CN114638839B

CN114638839B - Small sample video target segmentation method based on dynamic prototype learning

Info

Publication number: CN114638839B
Application number: CN202210536170.6A
Authority: CN
Inventors: 张天柱; 张哲�; 张勇东; 罗乃淞; 吴枫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-09-30
Anticipated expiration: 2042-05-18
Also published as: CN114638839A

Abstract

The invention discloses a small sample video target segmentation method based on dynamic prototype learning, which comprises the following steps: acquiring a video target to be segmented; and processing the video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result. According to the small sample video target segmentation method, the optimal transmission method is used for self-adaptive learning of the dynamic prototype, noise attention is effectively reduced, and meanwhile, a guiding mode is adopted for matching the multi-level feature map, so that the calculated amount is greatly reduced; the method can fully extract the target information in a small number of support set samples, and remarkably improves the segmentation performance on the video of the challenge set. The invention also discloses electronic equipment, a storage medium and a computer program product for executing the small sample video object segmentation method based on the dynamic prototype learning.

Description

Small sample video target segmentation method based on dynamic prototype learning

Technical Field

The invention relates to the field of computer vision, in particular to a training method of a small sample video target segmentation model and a video target segmentation method.

Background

Video target segmentation is a technology for predicting a foreground target mask in each frame of a video, and has wide application in the aspects of augmented reality, automatic driving, video editing and the like.

The prior art segmentation of video objects is typically based on semi-supervised and unsupervised. The semi-supervision method needs to give target information of a first frame of each video, then carries out dense association on targets in subsequent frames of the video, and the process seriously depends on a large amount of densely segmented and labeled data, so that time and labor are consumed; the unsupervised method has low performance due to the lack of the labeling data, and cannot meet the requirements of practical application. In addition, the above two methods cannot be well generalized to new target classes, and the segmentation capability on classes not seen in the training phase is sharply reduced, which limits the extensibility and practicability of video target recognition.

Disclosure of Invention

In view of the above, it is a primary object of the present invention to provide a small sample video object segmentation method based on dynamic prototype learning, an electronic device, a storage medium and a computer program product, which are intended to at least partially solve at least one of the above-mentioned technical problems.

According to a first aspect of the present invention, there is provided a small sample video object segmentation method based on dynamic prototype learning, including:

acquiring a video target to be segmented;

processing a video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result, wherein the small sample video target segmentation model based on dynamic prototype learning is obtained by training according to the following method:

processing the video frame images of the challenge set and the video frame images of the support set by utilizing a part of neural network layers of a feature extraction module of a small sample video object segmentation model to obtain low-level features of the challenge video frame and low-level features of the support video frame;

processing the video frame images of the challenge set by using all neural network layers of a feature extraction module of the small sample video object segmentation model to obtain the features of the challenge video frame;

carrying out mask operation on the low-level features of the support video frame to obtain foreground features of the support video frame;

processing foreground characteristics of the support video frame and characteristics of the challenge video frame by utilizing a mining module of a small sample video object segmentation model to obtain a corresponding relation matrix;

processing the low-level features of the support video frames, the low-level features of the inquiry video frames and the corresponding relation matrix by using a guide module of the small sample video target segmentation model to obtain a low-level corresponding relation matrix;

processing the corresponding relation matrix and the low-level corresponding relation matrix by using a segmentation module of the small sample video target segmentation model to obtain a video target segmentation result, and optimizing the small sample video target segmentation model by using a loss function of the small sample video target segmentation model;

and (4) performing feature extraction operation, masking operation, mining operation, guiding operation, segmentation operation and optimization operation in an iterative manner until the value of the loss function meets a preset condition to obtain a trained small sample video target segmentation model.

According to the embodiment of the present invention, the processing of the foreground feature of the support video frame and the feature of the challenge video frame by the mining module of the small sample video object segmentation model to obtain the corresponding relationship matrix includes:

processing foreground characteristics of the support video frame by utilizing a prototype generator of the mining module to obtain dynamic prototype characteristics;

calculating the dynamic prototype characteristics and the foreground characteristics of the support video frame to obtain a support corresponding relation matrix;

calculating the dynamic prototype characteristics and the characteristics of the inquiry video frame to obtain an inquiry corresponding relation matrix;

and operating the support corresponding relation matrix and the inquiry corresponding relation matrix to obtain a corresponding relation matrix.

According to an embodiment of the present invention, the processing of the foreground feature of the support video frame by the prototype generator of the mining module to obtain the dynamic prototype feature includes:

carrying out global average pooling on foreground features of the support video frame to obtain video target prototype features;

calculating foreground characteristics of the support video frame and prototype characteristics of the video target by using a prototype generator to obtain an attention matrix;

processing the attention matrix by using an optimal transmission algorithm to obtain an optimal distribution matrix;

and calculating the foreground characteristics of the support video frame and the optimal distribution matrix, and calculating the calculation result and the video target prototype characteristics to obtain the dynamic prototype characteristics.

According to an embodiment of the present invention, the above-mentioned attention matrix is determined by equation (1):

（1），

wherein the content of the first and second substances,

is the first

The foreground feature vector of each support video frame,

index representing foreground feature vector of support video frame, for length of

The foreground feature vector of the supporting video frame,

is in a range of values

，

Is the first

The characteristics of each prototype are characterized in that,

index representing prototype features for

The characteristics of each prototype are characterized in that,

is in the value range of

，

Is a matrix of support focus forces,

is a support attention force matrix

To (1) a

Go to the first

Column value for indicating the second

Individual prototype features and

similarity of foreground feature vectors of the individual support video frames;

wherein the dynamic prototype feature is determined by equation (2):

（2），

wherein the content of the first and second substances,

is a sequence of foreground feature vectors that support video frames,

is the first

Is characterized by a prototype

The obtained dynamic prototype characteristics are updated according to the dynamic prototype characteristics,

representing the optimized support set attention force matrix,

is the first of the optimized support attention force matrix

A row vector.

According to an embodiment of the present invention, the processing, by the guidance module of the small sample video object segmentation model, the low-level feature of the support video frame, the low-level feature of the inquiry video frame, and the correspondence matrix to obtain the low-level correspondence matrix includes:

selecting a preset row number and a preset column number of the corresponding relation matrix to obtain an intermediate corresponding relation matrix;

calculating the low-level feature of the support video frame and the intermediate corresponding relation matrix to obtain a reconstructed feature matrix;

and performing operation on the reconstructed feature matrix and the low-level features of the inquiry video frame to obtain a low-level corresponding relation matrix.

According to an embodiment of the present invention, the above guidance module is determined by formula (3) and formula (4):

（3），

（4），

wherein the content of the first and second substances,

is a temperature factor, for controlling the degree of smoothing of the output probability distribution,

the length of the modulus of the vector is represented,

is the first

An individual challenge video frame feature vector is generated,

index representing the feature vector of the challenge video frame, for height and width respectively

And

the image of the video frame of the challenge,

is in the value range of

，

Is an allocation matrix of dynamic prototype features and challenge video frame features,

first of the allocation matrix representing dynamic prototype features and challenging video frame features

Go to the first

The value of the column is such that,

an assignment matrix representing the optimized dynamic prototype features and the foreground features of the support video frame,

a correspondence matrix representing the features of the challenge video frame and the foreground features of the support video frame,softmaxrepresenting a normalized exponential function.

According to an embodiment of the present invention, the loss function of the small sample video object segmentation model includes a cross-over ratio loss function and a cross-entropy loss function;

wherein the cross entropy loss function is determined by equation (5):

（5），

wherein, the first and the second end of the pipe are connected with each other,

and

respectively representing the height and width of the incoming challenge video frame image or support video frame image,

which represents the product of the height and the width,

is the result of the real segmentation,

representing the second in the real segmentation result

Go to the first

The value of the column is such that,

is the result of the segmentation predicted by the model,

representation of the segmentation results of model predictions

Go to the first

The value of the column;

wherein the cross-over ratio loss function is determined by equation (6):

（6），

wherein the content of the first and second substances,

a norm of the matrix is represented.

According to a second aspect of the present invention, there is provided an electronic apparatus comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above dynamic prototype learning-based small sample video object segmentation method.

According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-mentioned small-sample video object segmentation method based on dynamic prototype learning.

The small sample video target segmentation method based on dynamic prototype learning provided by the invention has the advantages that the optimal transmission method is used for adaptively learning the dynamic prototype, the noise attention is effectively reduced, meanwhile, the multi-level feature maps are matched in a guiding mode, the calculated amount is greatly reduced, meanwhile, the video segmentation method provided by the invention fully extracts the target information in a small number of support set samples, and the segmentation performance on the video of an inquiry set is obviously improved.

Drawings

FIG. 1 is a flow chart of a small sample video object segmentation method based on dynamic prototype learning according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention;

FIG. 3 is a flow chart of obtaining a correspondence matrix according to an embodiment of the present invention;

FIG. 4 is a flow diagram for obtaining dynamic prototype features according to an embodiment of the present invention;

FIG. 5 is a flow diagram of obtaining a low-level correspondence matrix according to an embodiment of the invention;

FIG. 6 is a small sample video object segmentation model framework diagram based on dynamic prototype learning according to an embodiment of the present invention;

fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a small-sample video object segmentation method of base dynamic prototype learning, in accordance with an embodiment of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

The invention provides a small sample video target segmentation method based on dynamic prototype learning, which aims to reduce the dependence on data, improve the expansibility and the practicability and achieve better video target segmentation performance by using a small amount of data with labels.

In the current method, the method of dense matching by using multi-level features achieves the leading performance. However, dense matching of pixel-by-pixel features introduces a large amount of correspondence noise, and further processing on multiple scales increases the computational effort. The method provided by the invention can learn the target prototype in a self-adaptive manner, realize robust multi-level dense matching in a mode of an intermediate bridge, and effectively alleviate the problems of noise and large calculation amount.

The video segmentation method provided by the invention can be applied to an application system related to video object segmentation; the target in the input video is segmented according to the information provided by a small number of support set images, and the method can be widely applied to scenes such as augmented reality, automatic driving, video editing and the like. In a specific embodiment, the method can be embedded into a mobile device in a software form, and provides a real-time segmentation result of a recorded video; or the method can be installed in a background server to provide the processing result of the video in a large batch.

Fig. 1 is a flowchart of a small sample video object segmentation method based on dynamic prototype learning according to an embodiment of the present invention.

As shown in FIG. 1, the method includes operations S110 to S120.

In operation S110, a video object to be segmented is acquired.

In operation S120, a video object to be segmented is processed by using a small-sample video object segmentation model based on dynamic prototype learning, and a video object segmentation result is obtained.

Fig. 2 is a flowchart of a training method of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention.

As shown in FIG. 2, the method includes operations S210 to S270.

In operation S210, the challenge set video frame image and the support set video frame image are processed by using a part of the neural network layer of the feature extraction module of the small sample video object segmentation model, so as to obtain a low-level feature of the challenge video frame and a low-level feature of the support video frame.

The low-level features are processed by a part of neural network layers of the feature extraction module, so that the resolution is high, more detail information is included, but the low-level features are lower in semantic and more in noise. High-level features (or features) as opposed to low-level features, which traverse more layers of the neural network than low-level features, have stronger semantic information, but have a lower resolution and a poorer perception of detail.

In operation S220, all the neural network layers of the feature extraction module of the small sample video object segmentation model are used to process the video frame images of the challenge set, so as to obtain the features of the challenge video frame.

For input support set video frame images and inquiry set video frame images belonging to the same category, the feature extraction module is utilized to perform multi-level feature extraction based on a ResNet-50 network, and then a 1x1 convolutional layer is mapped to a common measurement space.

In operation S230, a mask operation is performed on the low-level features of the support video frame to obtain foreground features of the support video frame.

In operation S240, the mining module of the small sample video object segmentation model is used to process the foreground feature of the support video frame and the feature of the challenge video frame, so as to obtain a corresponding relationship matrix.

In operation S250, the low-level feature of the support video frame, the low-level feature of the challenge video frame, and the corresponding relationship matrix are processed by using the guidance module of the small-sample video object segmentation model to obtain a low-level corresponding relationship matrix.

In operation S260, the correspondence matrix and the low-level correspondence matrix are processed by using a segmentation module of the small-sample video object segmentation model to obtain a video object segmentation result, and the small-sample video object segmentation model is optimized by using a loss function of the small-sample video object segmentation model.

In operation S270, the feature extraction operation, the masking operation, the mining operation, the guiding operation, the segmentation operation, and the optimization operation are performed iteratively until the value of the loss function satisfies the preset condition, so as to obtain a trained small sample video target segmentation model.

According to the training method provided by the invention, a small sample video target segmentation model with reliability, generalization and high efficiency can be obtained by utilizing a dynamic prototype mining module and a multi-level dynamic guide module based on an optimal transmission algorithm; the trained small sample video target segmentation model is used for segmenting the video target, an optimal transmission method can be used for adaptively learning a dynamic prototype, noise attention is effectively reduced, meanwhile, a multi-level characteristic diagram is matched in a guiding mode, the calculated amount is greatly reduced, meanwhile, the video segmentation method provided by the invention fully extracts the target information in a small number of support set samples, and the segmentation performance on the challenge set video is obviously improved.

Fig. 3 is a flowchart of obtaining a correspondence matrix according to an embodiment of the present invention.

As shown in fig. 3, the mining module of the small sample video object segmentation model is used for processing foreground features of the support video frame and features of the inquiry video frame, and obtaining a corresponding relation matrix includes operations S310 to S340.

In operation S310, the foreground features of the support video frame are processed by using a prototype generator of the mining module to obtain dynamic prototype features.

In operation S320, the dynamic prototype feature and the foreground feature of the support video frame are calculated to obtain a support correspondence matrix.

In operation S330, the dynamic prototype feature and the feature of the challenge video frame are operated to obtain a challenge correspondence matrix.

In operation S340, the support correspondence matrix and the challenge correspondence matrix are operated to obtain a correspondence matrix.

In the process of acquiring the corresponding relation matrix, the feature points which support the incidence relation between the foreground features of the video frame and the features of the video frame to be inquired can be fully excavated by utilizing the excavation module of the dynamic prototype based on the optimal transmission algorithm, so that more firm data support is provided for the training of the subsequent model.

FIG. 4 is a flow diagram for obtaining dynamic prototype features according to an embodiment of the present invention.

As shown in fig. 4, processing the foreground feature of the support video frame by using the prototype generator of the mining module to obtain the dynamic prototype feature includes operations S410 to S440.

In operation S410, global average pooling is performed on the foreground features of the support video frame to obtain video target prototype features.

In operation S420, the prototype generator is used to perform operations on the foreground features of the support video frame and the prototype features of the video target, so as to obtain an attention matrix.

In operation S430, the attention matrix is processed using an optimal transfer algorithm to obtain an optimal allocation matrix.

In operation S440, the foreground feature of the supported video frame and the optimal allocation matrix are operated, and the operation result and the video target prototype feature are operated to obtain a dynamic prototype feature.

The processing process for acquiring the dynamic prototype features can effectively reduce the noise attention in the original video frame image, thereby improving the segmentation performance of the trained model.

（1），

wherein the content of the first and second substances,

is the firstiThe foreground feature vectors of each of the support video frames,

The foreground feature vector of the supported video frame,

is in the value range of

，

Is the first

The characteristics of each prototype are characterized in that,

index representing prototype features for

The characteristics of each prototype are characterized in that,

is in the value range of

，

Is a matrix of support concentration attention forces,

is the first to support the attention-focusing force matrix

Go to the first

Column value for indicating the second

Individual prototype characteristics and

wherein the dynamic prototype feature is determined by equation (2):

（2），

wherein the content of the first and second substances,

is a sequence of foreground feature vectors supporting video frames,

is the first

Character of individual prototype

the optimized support set attention matrix is shown, namely the support set attention matrix optimized by using the optimal transmission algorithm,

is the first of the optimized support set attention force matrix

Row vector, representing

The line prototype feature vector pairs support the similarity of the foreground feature vectors of the video frames.

FIG. 5 is a flow diagram of obtaining a low-level correspondence matrix according to an embodiment of the invention.

As shown in fig. 5, the obtaining of the low-level correspondence matrix by using the guidance module of the small sample video object segmentation model to process the low-level features of the support video frame, the low-level features of the inquiry video frame and the correspondence matrix includes operations S510 to S530.

In operation S510, a preset row number and a preset column number of the corresponding relationship matrix are selected to obtain an intermediate corresponding relationship matrix.

In operation S520, the low-level feature of the support video frame and the intermediate correspondence matrix are operated to obtain a reconstructed feature matrix.

In operation S530, the reconstructed feature matrix and the low-level features of the challenge video frame are operated to obtain a low-level corresponding relationship matrix.

（3），

（4），

wherein the content of the first and second substances,

as a temperature factor, useIn controlling the degree of smoothing of the output probability distribution,

the length of the modulus of the vector is represented,

is the first

An individual challenge video frame feature vector is generated,

And

the image of the video frame of the challenge,

is in the value range of

，

a first of the allocation matrices representing the dynamic prototype features and the features of the challenge video frame

Go to the first

The value of the column is such that,

Allocation matrices for dynamic prototype features and supporting video frame features

Then it is determined by:

，

allocation matrix representing dynamic prototype features and supporting video frame features

To (1) a

Go to the first

The value of the column is such that,

is the first

Is characterized by a prototype

Is updated toTo dynamic prototype features.

wherein the cross entropy loss function is determined by equation (5):

（5），

wherein the content of the first and second substances,

and

respectively representing the height and width of the incoming challenge video frame image or the support video frame image,

represents the product of said height and said width,

is the result of the true segmentation of the image,

representing the first in the real segmentation result

Go to the first

The value of the column is such that,

is the result of the segmentation predicted by the model,

representing the result of the segmentation predicted by the model

Go to the first

The value of the column;

wherein the cross-over ratio loss function is determined by equation (6):

（6），

where a norm of the matrix is represented.

Since the segmentation task is similar to the pixel-by-pixel classification task, intensive cross-entropy losses are used as constraints, while, in order to improve the final segmentation result

And label mask

The coincidence degree index of the invention is additionally added with an intersection ratio loss, and finally the loss function of the invention is formed by combining the intersection ratio loss function and the cross entropy loss function according to a certain weight coefficient; the loss function of the small sample video object segmentation model of the invention is shown in formula (7):

（7），

wherein the content of the first and second substances,

and

representing the weight coefficients.

By using the loss function as the constraint of the training method, the training effect of the small sample video target model can be improved, and a small sample video target segmentation model which has robustness and effectively reduces noise and is based on dynamic prototype learning is obtained.

Fig. 6 is a frame diagram of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention.

The training process of the model provided by the embodiment of the present invention is further described in detail with reference to fig. 6.

As shown in FIG. 6, the model training framework provided by the present invention comprises a dynamic prototype mining module based on an optimal transmission algorithm and a multi-level dynamic boot module. In a dynamic prototype mining module based on an optimal transmission algorithm, for input supporting set and inquiry set images belonging to the same category, multi-level features are extracted through a ResNet-50 network, and then a 1x1 convolutional layer is mapped to a common measurement space. Will support the feature

Flattening and extracting the inclusion using the corresponding mask

A sequence of foreground feature vectors supporting the video frames

Sending it to a prototype generator to obtain

An object prototype, as shown in equations (8) and (9):

（8），

（9），

GAP（Global Average Polean, GAP, global average pooling) is used to average a sequence of input supporting video frame foreground feature vectors,

representing a target global feature vector, prototype generator

Composed of a full connection layer and an activation function, which can generate prototype features according to the target features input by the current support set

，

Is shown in whichkA generator for generating a prototype of the object,

is composed of

Can be based on the attention moment matrix

Foreground pixel features are assigned to these prototypes as shown in equation (1):

（1），

in order to allocate a group of semantically consistent pixel features to the same prototype, an optimal allocation matrix is obtained based on an optimal transmission theory for adjusting the mapping relationship between the pixel features and the prototype, and this process mainly solves the optimization problem shown in formulas (10) and (11):

（10），

（11），

wherein the content of the first and second substances,

the vector is a vector of all 1 s,

representing the transition matrix to be solved for,

represents the optimal solution of the transition matrix to be solved,

the weighting operation is performed on the attention moment array, and is a weighting matrix, Tr represents the trace operation of the matrix,

which represents a constant coefficient of the constant,

the entropy function of the information is represented,

for transferring matrices

The space of possible solutions of (a) is,

the dimension of expression is

Can ultimately be based on

The updating results in a robust dynamic prototype, as shown in equation (12) and equation (2):

（12）,

（2），

wherein the content of the first and second substances,

representing the operation of element-by-element multiplication between matrices.

The above process can optimize the prototype vector through multiple iterations, and simultaneously purify the distribution matrix of the support set

。

In the multi-level dynamic guiding module, for a video frame of a challenge set to be segmented, a pseudo label can be allocated to each pixel feature by using a self-adaptively generated dynamic prototype, and meanwhile, a huge calculation amount generated in a dense matching process is reduced by using a calculation mode of an intermediate bridge, as shown in formulas (3) and (4):

（3），

（4），

is the temperature factor. High-level features at low resolution may use pairsCorrespondence matrix

And (4) reconstructing the characteristics of the support video frame, and inputting the reconstructed support video frame into a decoder to predict a segmentation result. For the low-level features with high resolution, a guiding method which can suppress noise by using a dynamic prototype and has less calculation amount is used for feature reconstruction. The concrete method is according to

Selecting position indexes of similarity from the support set characteristics to obtain corresponding characteristic vectors

Obtaining dense matching results for low-level features in an indirect guided manner

As shown in equation (13):

（13），

wherein the content of the first and second substances,

is to challenge the low-level features of the video frame

To (1)jThe number of feature vectors is determined by the number of feature vectors,

is selected to correspond to

The product of the two forms the low-level feature vector of the support video frame

Representing low-level feature dense matching results

To (1)jA value.

The small sample video target segmentation model obtained through the training process can realize that a small amount of marked images are input as supports to segment targets of the same category in video frames.

As shown in fig. 7, an electronic device 700 according to an embodiment of the present invention includes a processor 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.

In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. The processor 701 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 702 and/or the RAM 703. It is noted that the programs may also be stored in one or more memories other than the ROM 702 and RAM 703. The processor 701 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in the one or more memories.

Electronic device 700 may also include input/output (I/O) interface 705, which input/output (I/O) interface 705 also connects to bus 704, according to an embodiment of the invention. The electronic device 700 may also include one or more of the following components connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.

The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.

According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, a computer-readable storage medium may include the ROM 702 and/or the RAM 703 and/or one or more memories other than the ROM 702 and the RAM 703 described above.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A small sample video object segmentation method based on dynamic prototype learning comprises the following steps:

acquiring a video target to be segmented;

processing the video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result, wherein the small sample video target segmentation model based on dynamic prototype learning is obtained by training according to the following method:

processing the video frame images of the challenge set and the video frame images of the support set by utilizing a part of neural network layers of a feature extraction module of the small sample video object segmentation model to obtain low-level features of the challenge video frame and low-level features of the support video frame;

processing the video frame images of the challenge set by using all the neural network layers of the feature extraction module of the small sample video object segmentation model to obtain the features of the challenge video frame;

processing the foreground characteristics of the support video frame and the characteristics of the challenge video frame by utilizing a mining module of the small sample video target segmentation model to obtain a corresponding relation matrix;

processing the low-level features of the support video frame, the low-level features of the challenge video frame and the corresponding relation matrix by utilizing a guide module of the small sample video target segmentation model to obtain a low-level corresponding relation matrix;

processing the corresponding relation matrix and the low-level corresponding relation matrix by utilizing a segmentation module of the small sample video target segmentation model to obtain a video target segmentation result, and optimizing the small sample video target segmentation model by utilizing a loss function of the small sample video target segmentation model;

iteratively performing feature extraction operation, masking operation, mining operation, guiding operation, segmentation operation and optimization operation until the value of the loss function meets a preset condition to obtain a trained small sample video target segmentation model;

wherein, the processing the foreground characteristics of the support video frame and the characteristics of the challenge video frame by the mining module of the small sample video object segmentation model to obtain a corresponding relation matrix comprises:

processing the foreground characteristics of the support video frame by utilizing a prototype generator of the mining module to obtain dynamic prototype characteristics;

calculating the dynamic prototype characteristics and the characteristics of the challenge video frame to obtain a challenge corresponding relation matrix;

2. The method of claim 1, wherein said processing the support video frame foreground features with a prototype generator of the mining module to obtain dynamic prototype features comprises:

performing global average pooling on the foreground features of the support video frames to obtain video target prototype features;

calculating the foreground characteristic of the support video frame and the prototype characteristic of the video target by using the prototype generator to obtain an attention matrix;

and calculating the foreground characteristics of the support video frame and the optimal distribution matrix, and calculating the calculation result and the video target prototype characteristics to obtain dynamic prototype characteristics.

3. The method of claim 2, wherein the attention matrix is determined by equation (1):

（1），

is the first

The foreground feature vectors of each of the support video frames,

an index representing the foreground feature vector of the support video frame, for a length of

The foreground feature vector of the support video frame,

is in the value range of

，

Is the first

The characteristics of each prototype are characterized in that,

an index representing the prototype feature for

An instituteThe prototype feature is described as being a feature of the prototype,

is in the value range of

，

Is a matrix of support focus forces,

is the support attention force matrix

To (1)

Go to the first

Column value for indicating the second

The prototype features and

similarity of foreground feature vectors of the support video frames;

wherein the dynamic prototype feature is determined by equation (2):

（2），

is the support video frame foreground featureThe sequence of the feature vectors is then,

is the first

Character of individual prototype

representing the optimized support set attention force matrix,

is the first of the optimized support attention force matrix

A row vector.

4. The method of claim 1, wherein the processing the support video frame low-level features, the challenge video frame low-level features, and the correspondence matrix with a bootstrap module of the small sample video object segmentation model to obtain a low-level correspondence matrix comprises:

calculating the low-level characteristics of the support video frame and the intermediate corresponding relation matrix to obtain a reconstructed characteristic matrix;

and operating the reconstructed feature matrix and the low-level features of the inquiry video frame to obtain the low-level corresponding relation matrix.

5. The method of claim 1, wherein the guidance module is determined by formula (3) and formula (4):

（3），

（4），

the length of the modulus of the vector is represented,

is the first

An individual challenge video frame feature vector is generated,

an index representing the feature vector of the challenge video frame, for height and width respectively

And

of the challenge video frame image of (a),

is in the value range of

，

Go to the first

The value of the column is such that,

an assignment matrix representing the optimized dynamic prototype features and the support video frame foreground features,

and the softmax represents a normalized exponential function.

6. The method of claim 1, wherein the loss function of the small sample video object segmentation model comprises a cross-over ratio loss function and a cross-entropy loss function;

wherein the cross entropy loss function is determined by equation (5):

（5），

wherein the content of the first and second substances,

and

representing the product of said height and said width,

is the result of the real segmentation,

representing the first in the real segmentation result

Go to the first

The value of the column is such that,

is the result of the segmentation predicted by the model,

representing the result of the segmentation predicted by the model

Go to the first

The value of the column;

wherein the cross-over ratio loss function is determined by equation (6):

（6），

wherein the content of the first and second substances,

representing a norm of the matrix.

7. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.

8. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.