CN112801105B

CN112801105B - Two-stage zero sample image semantic segmentation method

Info

Publication number: CN112801105B
Application number: CN202110093474.5A
Authority: CN
Inventors: 刘亚洁
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-07-08
Anticipated expiration: 2041-01-22
Also published as: CN112801105A

Abstract

The invention discloses a two-stage zero sample image semantic segmentation method which comprises a background image segmentation module and a zero sample target classification module which are independent of each other in category. The classification-independent front background image segmentation adopts a two-stage image segmentation framework based on Mask-RCNN, an inner edge discriminator and an outer edge discriminator are assisted, and an edge self-supervision module improves the precision of the image front background segmentation. The zero sample target classification module is based on a CADA-VAE algorithm and assists in reverse generation of visual features by Deepinverse to reduce the domain distance between the visual features and semantic features, and the zero sample target classification precision is improved. The zero sample target segmentation method can obtain better image segmentation performance on an unknown target after training on a known target, greatly reduces the requirements of samples and complicated manual labeling, reduces the labeling cost in the professional fields of medicine and the like, and greatly improves the performance of image semantic segmentation tasks in the scenes without samples and with fewer samples.

Description

Two-stage zero sample image semantic segmentation method

Technical Field

The invention relates to the field of deep learning image Segmentation, in particular to a two-stage Zero sample image Semantic Segmentation (ZSS) method.

Background

With the development of computer vision and image technology, deep learning is widely applied to various fields such as image classification, image detection, image segmentation and the like with the advantage of high performance, and the advanced level of each field is rapidly reached. Image semantic segmentation is used as a basic computer vision problem (image classification, object recognition detection and semantic segmentation), and is widely applied to the fields of automatic driving, medical imaging, industrial detection and the like. While current fully supervised semantic segmentation methods rely heavily on intensive pixel-level semantic labels. The semantic labels at the pixel level are acquired at high labor and time cost, and particularly, the labeling cost caused by high labeling threshold in the professional fields such as medical images is unpredictable. In order to reduce labeling cost, algorithms of weak labels (such as image level labels and target box level labels) and individual labels (such as small sample learning) attract extensive attention and research. The zero sample segmentation problem which has more application significance and is more challenging has not been paid extensive attention and research at present.

The current zero sample target segmentation method is based on class semantic information of a one-stage Deeplab series prediction pixel level. There are two major problems with this type of approach: 1) the overall information of the target is not utilized, so different parts of the article are predicted to be of different categories. 2) The prediction at the pixel level causes more noise in the prediction mask, i.e., more irregular noise regions may be predicted on the background.

Disclosure of Invention

In order to solve the defects of the prior art and achieve the purpose of improving the performance of the zero sample target segmentation method, the invention adopts the following technical scheme:

a two-stage zero sample image semantic segmentation method comprises the following steps:

s1, based on Mask-RCNN two-stage irrelevant foreground and background image segmentation, based on Mask-RCNN two-stage image segmentation frame, the classification branch of the second stage is changed into only distinguishing the two classes of the front and background, after the image passes through RPN, the image is sent to the second stage to be classified into the front and background, fine adjustment of the detection frame and segmentation of the foreground, after the image passes through Mask-RCNN, the foreground detection frame and the foreground Mask of an object irrelevant to the class are obtained, and because the classification branch does not distinguish the object class, the method can be ensured to obtain the detection frame and the foreground Mask of an unknown class when the method is tested after training on a known class;

s2, zero sample target classification is carried out based on CADA-VAE, automatic coding and decoding of a visual characteristic domain and a semantic characteristic domain are respectively carried out by adopting a variational self-encoder method, the visual characteristic and the semantic characteristic are converted into a common hidden variable characteristic space, high reconstruction accuracy of the visual characteristic and the semantic characteristic is guaranteed, a hidden variable characteristic with strong characterization capability is obtained, cross-domain alignment of the visual characteristic domain and the semantic characteristic domain is guaranteed, the domain distance between the visual characteristic domain and the semantic characteristic domain is reduced by adding cross-domain coding and decoding supervision, an unknown class can be connected with the visual characteristic through the semantic characteristic at high accuracy, then a classifier is trained based on the hidden variable characteristic converted by the unknown class semantic characteristic, an encoder E and a decoder D are given, and the loss of cross-alignment is as follows:

where x represents the visual or semantic features of the input and i, j represent different domains.

Further, an edge self-supervision and inner and outer edge discriminator module is added to the image segmentation branch of the Mask-RCNN in the step S1 to assist in image foreground segmentation.

Furthermore, the edge self-monitoring module is embodied as equal transformation, namely affine transformation is carried out on the input image and the input image is sent into a foreground and background classification network to obtain an image segmentation result, the result is the same as the result obtained by carrying out the same affine transformation on the image segmentation result of the original input image, the module can effectively eliminate the noise of the segmentation result and ensure the consistency of the segmentation result, and the foreground and background classification network F is used for realizing the function of the edge self-monitoring module_θThe affine transformation matrix a,

the edge unsupervised loss is defined as follows:

wherein x represents an input picture to be divided, and w' represents

The weight matrix of (2).

Further, the inner edge and outer edge discriminator module is divided into an inner edge discriminator and an outer edge discriminator, the inner edge discriminator is used for judging whether the edge of the object is inside the object, the outer edge discriminator is used for judging whether the segmentation edge contains an image background, in the training process, the marking mask is expanded to obtain a simulated outer edge, the marking mask is corroded to obtain a simulated inner edge, the inner edge and outer edge discriminator is used for judging whether the marking mask is the inner edge or the outer edge, a mode of generation and discrimination countertraining is adopted to assist a generator to generate an edge with higher precision, and image foreground segmentation is assisted to obtain higher segmentation precision.

Further, the discriminator adopts a multilayer perceptron.

Further, in step S2, performing depinvering to reversely generate a visual feature assisted zero sample target classification, where depinvering reversely generates a visual feature map by using a trained model, and adds the visual feature map as a visual feature to the CADA-VAE zero sample target classification method, so as to align semantic features of an unknown class with the visual feature, reduce a domain distance between the visual feature of the unknown class and the semantic features, and improve classification accuracy.

Further, the DeepInverson joins a teacher network and a student network, i.e., knowledge distillation is adopted, and the KL divergence loss of the obtained features is monitored, so that the diversity of generated images is increased.

Further, adding supervision of moving average and moving average variance of each BN layer in the trained model, adding supervision of generating images, namely two norms and variances of visual feature graphs generated by the open source model reversely, increasing reality of the generated images, wherein l represents each layer of the network, u represents each layer of the network, and_l，

mean and variance, respectively, the BN layer is normalized as:

where E denotes the expectation, X denotes the data distribution, X denotes the image before synthesis,

representing the synthesized image.

Furthermore, in the first stage, an external rectangular frame of the object is obtained at the same time, and the visual characteristics are obtained after the content of the external rectangular frame passes through the network layer.

Furthermore, the semantic features are semantic word vectors or attribute vectors, the semantic word vectors are obtained through training of NLP models such as BERT, and the attribute vectors are obtained through existing data sets.

The invention has the advantages and beneficial effects that:

the method completely avoids expensive manpower and time cost consumed by sample labeling in the fully supervised semantic segmentation method, can be quickly applied to various fields, and particularly promotes the related methods in professional fields to be quickly improved.

Drawings

Fig. 1 is a frame configuration diagram of the present invention.

Detailed Description

The following describes in detail embodiments of the present invention with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

The invention avoids and inhibits the problems of the current method to a great extent, finally improves the performance of the zero sample target segmentation method, promotes the research progress in the field of the zero sample target segmentation method, and accelerates the application in the scientific research and engineering fields. And a brand-new two-stage segmentation algorithm is adopted to realize the knowledge transfer from the semantic segmentation of the known class (the class with the semantic labels available) to the semantic segmentation of the unknown class (the class with the semantic labels unavailable). Mask-RCNN is divided into two phases, the first phase scans the image and generates proposals (i.e., areas that are likely to contain an object), and the second phase classifies the proposals and generates detection boxes and masks.

1) Mask-RCNN-based two-stage category-independent foreground and background image segmentation

A two-stage image segmentation frame based on Mask-RCNN changes the classification branch of the second stage into a two-stage image segmentation frame which only distinguishes front and background, and the image is sent to the second stage to classify the front and background, finely adjust a detection frame and segment the foreground after passing through RPN (region generated Network). The image is processed by Mask-RCNN to obtain a foreground detection frame and a foreground Mask of an object which are irrelevant to the class, and the classification branch does not distinguish the class of the target, so that the method can be ensured to obtain the detection frame and the foreground Mask of an unknown class when the method is tested after training on the known class.

2) Edge self-supervision and inner and outer edge discriminator assisted image foreground segmentation

And adding an edge self-supervision and inner and outer edge discriminator module in an image segmentation branch of the Mask-RCNN to assist in image foreground segmentation.

The edge self-supervision module is embodied as an equality transformation. The method comprises the steps of performing affine transformation on an input image, sending the input image into a foreground and background classification network to obtain an image segmentation result, wherein the result is the same as the result obtained by performing the same affine transformation on the image segmentation result of the original input image. The module can effectively eliminate the noise of the segmentation result and ensure the consistency of the segmentation result. Given a partitioned network F_θThe set of affine transformation matrices a,

the edge unsupervised loss is defined as follows:

wherein x represents an input picture to be divided, and w' represents

The weight matrix of (2).

The inner edge and outer edge discriminator is divided into an inner edge discriminator and an outer edge discriminator. The inner edge discriminator mainly judges whether the object edge is inside the object, and the outer edge discriminator mainly judges whether the segmentation edge contains the image background. The method comprises the steps of expanding a label mask in a training process to obtain a simulated outer edge, corroding the label mask to obtain a simulated inner edge, judging whether the label mask is the inner edge or the outer edge by adopting an inner edge and outer edge discriminator, generating a higher-precision edge by adopting a mode of generating and discriminating countermeasure training, assisting a generator to generate a higher-precision edge, assisting image foreground segmentation and obtaining higher segmentation precision. The discriminator can be a multilayer perceptron.

3) CADA-VAE-based zero-sample target classification method

The CADA-VAE-based zero sample target classification is carried out, automatic coding and decoding of a visual characteristic domain and a semantic characteristic domain are respectively carried out by adopting a variational self-encoder method, the visual characteristic and the semantic characteristic are converted into a common hidden variable characteristic space, the visual characteristic and the semantic characteristic can achieve high reconstruction accuracy, and the hidden variable characteristic with high representation capability is obtained.

In the first stage, an external rectangular frame of the object can be obtained at the same time, visual features are obtained after the contents of the external rectangular frame pass through a network layer, the semantic features are semantic word vectors or attribute vectors, the semantic word vectors can be obtained through training of NLP models such as BERT and the like, and the attribute vectors can also be directly provided with data sets.

And then, in order to ensure cross-domain alignment of the visual characteristic domain and the semantic characteristic domain, reducing the domain distance between the visual characteristic domain and the semantic characteristic domain by adding cross-domain coding and decoding supervision, so that an unknown class can be associated with the visual characteristic through the semantic characteristic at high precision, and then training a classifier based on the hidden variable characteristic of the unknown class semantic characteristic conversion. Given encoder E, decoder D, the penalty for cross-alignment is:

4) Zero sample object classification assisted by using DeepInversion reverse generation of visual features

The DeepInverson utilizes an open source model trained in ImageNet to reversely generate a visual feature map, adopts the idea of knowledge distillation, adds supervision of a moving average value and a moving average variance of each Batch Norm (BN) layer in the trained model, adds supervision of a two-Norm and a variance of a generated image (namely, the open source model reversely generates the visual feature map), and increases the authenticity of the generated image. Each layer of the network, u_l，

Representing the mean and variance, respectively, the BN layer is regularized as:

wherein E represents expectation, X represents data scoreCloth, x represents an image before synthesis,

representing the synthesized image.

DeepInverson joined teacher and student networks, i.e., knowledge distillation, to gain supervision of KL divergence loss of features to increase the diversity of the images generated. And adding a visual feature graph generated reversely by DeepInverson as a visual feature into a CADA-VAE zero sample target classification method, aligning the semantic feature of the unknown class with the visual feature, and reducing the domain distance between the visual feature of the unknown class and the semantic feature, thereby improving the classification precision.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the embodiments of the present invention in nature.

Claims

1. A two-stage zero sample image semantic segmentation method is characterized by comprising the following steps:

s1, based on two-stage classification irrelevant foreground and background image segmentation of Mask-RCNN, based on two-stage image segmentation frame of Mask-RCNN, the classification branch of the second stage is changed into only distinguishing two types of the foreground and the background, after the image passes through RPN, the image is sent to the second stage to be classified into the foreground and the background, fine adjustment of a detection frame and segmentation of the foreground, and after the image passes through the Mask-RCNN, a foreground detection frame and a foreground Mask of an object irrelevant to the classification are obtained;

s2, performing zero sample target classification based on CADA-VAE, firstly, respectively performing automatic coding and decoding of a visual characteristic domain and a semantic characteristic domain, converting the visual characteristic and the semantic characteristic into a common hidden variable characteristic space, then reducing the domain distance between the visual characteristic domain and the semantic characteristic domain by adding cross-domain coding and decoding supervision, then training a classifier based on the hidden variable characteristic converted by the unknown semantic characteristic, and giving a coder E and a decoder D, wherein the loss of cross alignment is as follows:

2. The method according to claim 1, wherein the Mask-RCNN image segmentation branch in step S1 is added with an edge self-supervision and inner-outer edge discriminator module.

3. The two-stage zero-sample image semantic segmentation method as claimed in claim 2, wherein the edge self-supervision module is embodied as an equality transformation, that is, affine transformation is performed on the input image and sent to a foreground and background classification network to obtain an image segmentation result, which is the same as the result obtained by performing the same affine transformation on the image segmentation result of the original input image, and the foreground and background classification network F is used for performing the same affine transformation on the image segmentation result of the original input image_θAffine transformation matrices A, F_θ(x) The edge unsupervised loss is defined as follows:

wherein x represents an input picture to be divided, and w' represents F_θA weight matrix of (Ax).

4. The two-stage zero sample image semantic segmentation method as claimed in claim 2, wherein the inner and outer edge discriminator module is divided into an inner edge discriminator and an outer edge discriminator, the inner edge discriminator is used for judging whether an object edge is inside the object, the outer edge discriminator is used for judging whether a segmented edge contains an image background, in the training process, a marking mask is expanded to obtain a simulated outer edge, the marking mask is corroded to obtain a simulated inner edge, the inner and outer edge discriminator is used for judging whether the segmented edge is an inner edge or an outer edge, and a generation and discrimination countertraining mode is used for assisting a generator to generate a higher-precision edge.

5. The method as claimed in claim 4, wherein the discriminator employs a multi-layer perceptron.

6. The two-stage zero-sample image semantic segmentation method as claimed in claim 1, wherein in step S2, a deputy inverse-generation visual feature-aided zero-sample object classification is adopted, and the deputy inverse-generation visual feature map is generated by using a trained model and is added to the CADA-VAE zero-sample object classification method as the visual feature.

7. The two-stage zero-sample image semantic segmentation method of claim 6, wherein the DeepInverson incorporates teacher's network and student's network, i.e., knowledge distillation is used to obtain supervision of KL divergence loss of features.

8. The method as claimed in claim 6, wherein the monitoring of moving average and moving average variance of BN layers in the trained model is added, the monitoring of generating image, i.e. binary norm and variance of visual feature map generated by the open source model is added, l represents each layer of network, u represents each layer of network, and u represents each layer of network_l，

Mean and variance, respectively, the BN layer is normalized as:

wherein E represents a desire, and X represents a numberAccording to the distribution, x represents an image before synthesis,

representing the synthesized image.

9. The two-stage zero-sample image semantic segmentation method as claimed in claim 1, wherein the first stage simultaneously obtains a circumscribed rectangle frame of the object, and the content of the circumscribed rectangle frame passes through a network layer to obtain the visual features.

10. The two-stage zero-sample image semantic segmentation method as claimed in claim 1, wherein the semantic features are semantic word vectors or attribute vectors, the semantic word vectors are obtained by training of an NLP model, and the attribute vectors are obtained by an existing data set.