CN116824140A

CN116824140A - Small sample segmentation method for test scene non-mask supervision

Info

Publication number: CN116824140A
Application number: CN202310719486.3A
Authority: CN
Inventors: 于云龙; 陈善娟
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-29

Abstract

The invention discloses a small sample segmentation method for test scene non-mask supervision, which specifically comprises the following steps: acquiring an image dataset for training a deep neural network model; using an image pair constructed by an image dataset and a mask thereof as a supervision signal, training a pre-designed iterative optimization depth neural network model without the mask through cross entropy loss; and outputting the prediction mask of the image to be segmented by using the trained deep neural network model without mask supervision. The method is suitable for a small sample segmentation task which can provide large-scale marked data during training and has no dense marking information except other images belonging to the same category with the image to be segmented during testing, and solves the problem that the current small sample segmentation task still needs to carry out dense marking on a new type of image during reasoning, and good segmentation performance is obtained under the condition of only using an image pair, particularly for single-target segmentation.

Description

Small sample segmentation method for test scene non-mask supervision

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a test scene non-mask supervision-oriented small sample segmentation method, computer equipment and a storage medium.

Background

Semantic segmentation is one of important basic tasks in computer vision, and can be widely applied to tasks such as automatic driving, medical image segmentation and the like. With the continuous improvement of network structures and models, the semantic segmentation task has achieved excellent performance. However, the semantic segmentation model is driven by data, needs to be trained by large-scale image data with dense marks at pixel level, and does not have the capability of generalizing to new types of data. The existing semantic segmentation model can only segment target categories appearing in the training process, and when new category images appear, a large number of new category dense marking samples need to be collected for retraining the model, which is not practical in practical application, so that further popularization and application of the new category images are limited.

Unlike existing machine learning models that rely on big data training, humans can use knowledge accumulated in the past to quickly identify this new concept by just one or a few new class samples. Inspired by human learning new knowledge, the gap between artificial intelligence and human learning is reduced, and small sample learning is proposed. Small sample learning reduces the need for large scale marker datasets and has attracted much attention in recent years. Inspired by small sample learning, small sample semantic segmentation tasks are proposed, and the idea of small sample learning is expanded into segmentation tasks. The small sample segmentation realizes the segmentation of the new class of targets under the condition that only one or a few densely-marked samples are needed, thereby reducing the requirement for large-scale densely-marked data in the semantic segmentation task.

In the small sample segmentation task, one or several new types of densely marked reference samples still need to be provided for segmenting new types of images during testing, and although the dependency on large-scale data is greatly reduced compared with the traditional semantic segmentation task, certain requirements are still met for marked data.

Disclosure of Invention

In order to solve the above problems, the present invention aims to provide a small sample segmentation method for test scene non-mask supervision, which does not need to provide any densely marked sample information during reasoning, and only uses another image containing the same category of targets as the image to be segmented to provide guiding information, thereby further reducing the need for marked data. The image is very inexpensive and easy to acquire compared to the mask information of the dense marks. The ability to mine commonly existing class features in images can be learned by training the model against existing large-scale dense marker datasets by obtaining a large number of similar images in a map search for pairs of input images that are used to construct the model. The model can divide the target to be divided by only using the image pairs containing the same category, so that the time and labor consuming marking process is greatly reduced, the man-made interaction is reduced, and meanwhile, the generalization capability is stronger.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the small sample segmentation method for test scene non-mask supervision comprises the following steps:

s1, acquiring an image dataset for training a deep neural network model;

s2, using an image pair constructed by an image data set and a mask thereof as a supervision signal, and training a pre-designed iterative optimization depth neural network model without the mask through cross entropy loss;

s3, using the trained deep neural network model to monitor and output a prediction mask of the image to be segmented without a mask.

Further, in step S1, the training image dataset isWhere N is the number of tasks (epoode) built in the training set, ++>And->Support images respectively representing ith provision category guide information and corresponding masks thereof>And->Respectively representing the ith query image to be segmented and the corresponding mask, M _i E {1, …, C }, C representing the total number of categories, each category containing multiple images. The mask obtained by sampling is processed in advance, all kinds of target objects which are simultaneously present in the support and query images are set as foreground, and the rest are set as background.

Further, the deep neural network model includes three components: the mask generation module and the decoder are parts requiring training, wherein the feature extractor is ResNet-50 pre-trained by using an ImageNet dataset.

Further, the step S2 specifically includes:

s21, adopting a batch processing mode when training a deep neural network model, randomly sampling a batch of images from an image dataset to a sample set Where the batch size N _bs Presetting;

s22, inputting the images in the image pair sample set B into a feature extractor to obtain feature graphs of the support image and the query imageWherein L is the sum of the layers of each module in the feature extractor;

s23, generating a high-level feature map of the support image and the query image from the last module of the feature extractor through a maskThe module obtains an initial supporting prediction mask using similarity calculation and a cross-attention mechanismFor the subsequent segmentation task and calculating cross entropy loss with the true support mask value;

s24, inputting the feature graphs of the support and query features from the last two modules of the feature extractor into a feature enhancement module in the decoder, capturing global information by using a self-attention mechanism and enhancing the support and query features;

s25, the initial support prediction maskAnd the support and Query features output by the feature enhancement module are input into a transducer structure in the decoder, wherein the Query features are used as Query, and the support features are used as Key, & lt/EN & gt>As Value, calculate the similarity between query feature and support feature and for +.>Weighting to obtain initial prediction mask of query image, fusing it with the bottom features of query and support image from the first two modules of feature extractor, and obtaining query prediction mask by decoder>

S26, repeating the step S24, taking the Query mask obtained in the step S25 as the Value of a transducer module in the decoder, taking the support feature as Query, taking the Query feature as Key, calculating to obtain an initial prediction mask of the support image, fusing the initial prediction mask with the Query from the first two modules of the feature extractor and the bottom features of the support image, and obtaining the support prediction mask through the decoder

S27, repeating the steps S24 to S26 for T times, and calculating cross entropy loss between the query and the support prediction mask obtained in the last iteration and the real mask value after the iteration is finished;

s28, obtaining a total loss function based on cross entropy loss between each prediction mask and the real mask, wherein the total loss function is as follows:

wherein α, β, γ are preset weight parameters for representing contributions of the three prediction masks;

s29, training a deep neural network model by using an Adam optimizer or a random gradient descent optimizer with momentum and a back propagation algorithm according to the obtained total loss function;

further, in step S28, the cross entropy loss function L is:

wherein the method comprises the steps ofFor the true label of image i at position j, +.>For the model to be the output prediction of image i at position j, HW is the number of pixels of the whole image.

Further, in step S3, the operations without mask supervision during reasoning are as follows: during reasoning, an image pair is formed by optionally selecting an image which belongs to the same category as the object to be segmented, the image pair is input into a trained deep neural network model for prediction, and a query image prediction mask to be segmented generated in the last iteration is outputAs a final segmentation prediction result.

Compared with the existing method, the small sample segmentation method for the test scene non-mask supervision has the following contributions:

firstly, the invention provides a new scene setting, only uses image pairs to execute a small sample segmentation task during reasoning, and provides a small sample segmentation method for test scene non-mask supervision.

Secondly, a mask generating module is designed, and an initial supporting prediction mask is generated by using high-level semantic features of a supporting image and a query image, so that category information of a target to be segmented is provided for a segmentation task; the feature enhancement module is provided, global features of features are learned by using a self-attention mechanism, and the module is shared in parameters for supporting and inquiring the features, so that category information of targets commonly existing in two images can be better mined; an alternate iteration optimization module is provided, the alternate query features and the support features are used for optimizing the query and supporting the prediction mask alternately according to the positions of the alternate query features and the support features input by the transducer module, and then the prediction mask output by the last iteration is used for optimizing the iteration.

Finally, the invention does not need to provide any densely-marked reference sample for segmentation during reasoning, only utilizes the easily-obtained image to carry out reasoning prediction on the input network, greatly reduces the requirement on the densely-marked sample in the segmentation task, obtains competitive performance in the single-target segmentation task, and has certain generalization capability.

The small sample segmentation method for the test scene non-mask supervision has good application value, and the whole process realizes end-to-end full-automatic processing without any additional labeling data except training data. For example, in some areas with few marked samples, the marked samples can be first roughly marked by the method provided herein, then corrected and marked by manual work, and for some category images without semantic tags, the images can be segmented as long as the same category images can be retrieved.

Drawings

Fig. 1 is a flow chart of a small sample segmentation method for test scene non-mask supervision.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

On the contrary, the invention is intended to cover any alternatives, modifications, equivalents, and variations as may be included within the spirit and scope of the invention as defined by the appended claims. Further, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. The present invention will be fully understood by those skilled in the art without the details described herein.

Referring to fig. 1, in a preferred embodiment of the present invention, a small sample segmentation method for test scene maskless supervision includes the following steps:

first, an image dataset for training a deep neural network model is acquired. The image dataset isWhere N is the number of tasks (epoode) built in the image dataset, +.>And->Support images respectively representing ith provision category guide information (support) and corresponding masks thereof, < ->And->Respectively representIth query image to be segmented (query) and corresponding mask, M _i E {1, …, C }, C representing the total number of categories, each category containing multiple images. And processing the mask obtained by sampling, and setting all kinds of target objects which are simultaneously present in the support image and the query image as foreground and setting the rest as background.

Secondly, using the image pairs constructed by the image data set and the masks thereof as supervision signals, training a pre-designed iterative optimization deep neural network model without masks through cross entropy loss. Wherein, the deep neural network model comprises three components: the feature extractor is mainly characterized by using an ImageNet pre-trained ResNet-50, network parameters are frozen in the training process, and the mask generating module and the decoder are parts needing training. The method specifically comprises the following steps:

firstly, adopting a batch processing mode when training a deep neural network model, and firstly randomly sampling a batch of images from an image data set to a sample set Where the batch size N _bs And (5) presetting.

Step two, firstly inputting the images in the image pair sample set B into a feature extractor to obtain feature images of the support image and the query imageWhere L is the sum of the number of layers of each module in the feature extractor.

Third, the high-level feature image of the support image and the query image from the last module of the feature extractor is used for obtaining an initial support prediction mask by using similarity calculation and a cross-attention mechanism through a mask generation moduleFor subsequent segmentation tasks and calculates cross entropy loss with true support mask values:

and fourthly, inputting the feature graphs of the support and query features from the last two modules of the feature extractor into a feature enhancement module in the decoder, capturing global information by using a self-attention mechanism and enhancing the support and query features.

Fifth step, the initial support prediction maskAnd the support and Query feature output by the feature enhancement module is input into a transducer module, wherein the Query feature is used as Query, and the support feature is used as Key +.>As Value, calculate the similarity between query feature and support feature and for +.>Weighting to obtain initial predictive mask of query image, fusing it with query from two former modules of feature extractor and bottom features of supporting image, and obtaining query predictive mask by decoder>

Sixth, repeating the fourth step, using the Query mask obtained in the fifth step as Value of the transducer module, using the support feature as Query, using the Query feature as Key, calculating to obtain an initial prediction mask of the support image, fusing the initial prediction mask with the Query from the first two modules of the feature extractor and the bottom features of the support image, and obtaining the support prediction mask through a decoder

Seventhly, iterating the fourth step, the fifth step and the sixth step for T times, and calculating the query obtained by the last iteration and the cross entropy loss between the supporting prediction mask and the real mask value after the iteration is finished;

eighth step, obtaining a total loss function based on cross entropy loss between each prediction mask and the real mask, wherein the total loss function is as follows:

where α, β, γ are preset weight parameters to represent the contributions of the three prediction masks.

And ninth, training a deep neural network model by using an Adam optimizer or a random gradient descent with momentum optimizer and a back propagation algorithm according to the obtained total loss function.

And finally, utilizing the trained deep neural network model to monitor and output a prediction mask to be segmented by the image without masking. During reasoning, an image pair is formed by optionally selecting an image which belongs to the same category as the object to be segmented, the image pair is input into a trained deep neural network model for prediction, and a query image prediction mask to be segmented generated in the last iteration is outputAs a final segmentation prediction result.

Through the technical scheme, the embodiment of the invention is directed to the small sample segmentation method for test scene non-mask supervision. The invention does not need to provide any densely-marked reference sample for segmentation during reasoning, only utilizes the easily-obtained image to carry out reasoning prediction in the input network, greatly reduces the requirement of the segmentation task on the densely-marked sample, and has certain generalization capability.

The embodiment of the invention also provides a computer, which comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the steps of the small sample segmentation method for test scene non-mask supervision shown in the embodiment are executed.

An embodiment of the present invention further provides a computer readable storage medium, where the storage medium includes a stored program, and the computer readable storage medium stores computer instructions for causing a computer to execute, when the computer runs, the steps of the small sample segmentation method for test scene maskless supervision shown in the above embodiment.

It will be appreciated by those skilled in the art that the modules or steps of the invention may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The small sample segmentation method for test scene non-mask supervision is characterized by comprising the following steps of:

s1, acquiring an image dataset for training a deep neural network model;

2. The small sample segmentation method for test scene-oriented maskless supervision of claim 1, wherein in step S1, said image dataset isWhere N is the number of tasks built in the image dataset, < >>And->Support images respectively representing ith provision category guide information and corresponding masks thereof>And->Respectively representing the ith query image to be segmented and the corresponding mask, M _i E {1, …, C }, C representing the total number of categories, each category containing multiple images.

3. The small sample segmentation method for test scene-oriented non-mask supervision according to claim 2, further comprising processing the obtained mask to set all kinds of target objects simultaneously appearing in the support image and the query image as foreground and the rest as background in step S1.

4. The test scene maskless supervised small sample segmentation method of claim 3, wherein the deep neural network model includes three components: the mask generation module and the decoder are parts requiring training, wherein the feature extractor is ResNet-50 pre-trained by using an ImageNet dataset.

5. The small sample segmentation method for test scene-oriented non-mask supervision as recited in claim 4, wherein step S2 specifically includes:

s22, inputting the image pairs in the image pair sample set B into a feature extractor to obtain feature graphs of the support image and the query imageWherein L is the sum of the layers of each module in the feature extractor;

s23, obtaining an initial supporting prediction mask by using similarity calculation and a cross-attention mechanism through a mask generation module by using a supporting image from the last module of the feature extractor and a high-level semantic feature image of the query imageFor a subsequent segmentation task and calculating a cross entropy loss between the support prediction mask and the true support mask value;

s25, the initial support prediction maskAnd a transducer module for inputting the support and Query features output by the feature enhancement module into the decoder, wherein the Query features are used as Query, the support features are used as Key, and the +.>As Value, calculate the similarity between query feature and support feature and for +.>Weighting to obtain initial predictive mask of query image, fusing it with query from two former modules of feature extractor and bottom features of supporting image, and obtaining query predictive mask by decoder>

S26, repeating the step S24, and then masking the query prediction obtained in the step S25As Value of a transducer module in a decoder, the support feature is used as Query, the Query feature is used as Key, the initial prediction mask of the support image is obtained by calculation, and the initial prediction mask is fused with the Query from the first two modules of the feature extractor and the bottom feature of the support image, and then the support prediction mask is obtained through the decoder>

s29, training a deep neural network model by using an Adam optimizer or a random gradient descent optimizer with momentum and a back propagation algorithm according to the obtained total loss function.

6. The small sample segmentation method for test scene-oriented non-mask supervision according to claim 5, wherein in step S28, the cross entropy loss function L is:

wherein the method comprises the steps ofFor the true label of image i at position j, +.>The model is the output prediction of image i at position j.

7. The small sample segmentation method for test scene-oriented non-mask supervision according to any one of claims 1-6, wherein in step S3, the operations without mask supervision during reasoning are: during reasoning, an image pair is formed by optionally selecting an image belonging to the same class as the object to be segmented, and the image pair is input into the trained depth godPredicting in the network model, and outputting the prediction mask of the query image to be segmented generated in the last iterationAs a final segmentation prediction result.