CN116597151B

CN116597151B - Unsupervised semantic segmentation method based on fine-grained feature grouping

Info

Publication number: CN116597151B
Application number: CN202310871120.8A
Authority: CN
Inventors: 于潇丹
Original assignee: Nanjing Yaxin Software Co ltd
Current assignee: Nanjing Yaxin Software Co ltd
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2023-09-26
Anticipated expiration: 2043-07-17
Also published as: CN116597151A

Abstract

The invention discloses an unsupervised semantic segmentation method based on fine-grained feature grouping, which comprises the steps of extracting features extracted by a convolutional neural network by using a capsule network, reserving important information, reducing interference of irrelevant information and improving the overall segmentation effect. On the other hand, the super-pixel segmentation is used for pre-segmenting the picture, and the boundaries of the pre-segmented areas are used for guiding the subsequent pixel characteristics to carry out fine granularity grouping, so that the problem of poor segmentation in the case of fuzzy boundary is solved. According to the invention, the subsequent advanced semantic information is subjected to fine-grained feature grouping according to the segmentation boundary and is used as the input of a Capsule layer, so that the number of capsules can be reduced; center difference is carried out in each super pixel block, so that detail information of a segmentation edge in the pixel block can be highlighted; the differential image and the multi-scale feature map are fused and are jointly used as the input of the capsule layer, so that the expression capacity of the network to the segmentation boundary can be improved.

Description

Unsupervised semantic segmentation method based on fine-grained feature grouping

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an unsupervised semantic segmentation method based on fine-grained feature grouping.

Background

Image semantic segmentation tasks are an important component in the field of computer vision. It aims to classify each pixel point in an image by using artificial intelligence technology, so that a region of interest in the image is highlighted. The method is widely applied to scenes such as face recognition, license plate recognition, satellite image analysis, automatic driving, man-machine interaction, video processing and the like. Most of the conventional semantic segmentation methods are supervised, pixel-level labeling is required to be performed on a target area in an image in advance, and the amount of data required for training is large, which often requires a great deal of manpower and time. Meanwhile, since the labeled category is fixed, the learned model is limited to a few labeled categories, and the model cannot be generalized to unknown categories. The unsupervised semantic segmentation method can well solve the problem, and can realize end-to-end semantic segmentation without data annotation. However, the segmentation effect of the unsupervised semantic segmentation is far less than that of the supervised semantic segmentation, and how to improve the accuracy of the unsupervised semantic segmentation becomes an important research direction.

The main detection ideas of the existing disclosed unsupervised semantic segmentation method are as follows:

an unsupervised semantic segmentation method and system for large-scale data are provided in a chinese patent application (hereinafter referred to as patent 1) with publication number CN 202110600887. Firstly, acquiring a plurality of images to be segmented; inputting the acquired image into a segmentation network model to obtain a semantic segmentation result; the segmentation network model is trained in an unsupervised mode, and the training process is as follows: performing characterization learning based on a pixel attention mechanism on the acquired training image to obtain an image characterization result; clustering is carried out according to the obtained image characterization result, and a plurality of pseudo tags are obtained; training a segmentation network model according to the obtained pseudo tag; according to the method, through a pixel attention mechanism and a pixel alignment mechanism, foreground saliency information generated based on an unsupervised method is used for supervising the learning of the pixel attention mechanism, so that the efficiency and the precision of semantic segmentation are improved.

Paper PicE Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp.16794-16804 (hereinafter referred to as paper 1) proposes an unsupervised semantic segmentation method using invariance and isodegeneration in clustering that incorporates photometric invariance and geometric isodegeneration into a deep convolution framework to learn advanced semantic concepts without requiring hyper-parameters and pre-processing. The algorithm flow comprises the following steps: firstly, carrying out luminosity invariance, geometric invariance and convolution neural network on an input data set to obtain the characteristic of each pixel of an image to obtain a characteristic vector, and then clustering the characteristic vector by using K-means to obtain K clustering centers and labels corresponding to each pixel. After all training images are clustered, all pixels have corresponding pseudo labels, and finally the pseudo labels are used for supervision training, so that a stable result is finally obtained.

The unsupervised semantic segmentation belongs to the pixel-level classification problem, the traditional method depends on the consistency of segmentation target semantics to a great extent in a processing mode, and the method adopted for clustering the features is suitable for single-label and object-centered images, and has poor segmentation effect on multi-label, scene-centered or target object-smaller images. Meanwhile, the whole unsupervised training process can be simultaneously iterated by the classification network and the clustering algorithm, and if no effective constraint conditions and training strategies exist, the training effect is difficult to guarantee.

Among the common unsupervised semantic segmentation methods, end-to-end methods often learn the clustering function by imposing consistency on the clustering assignment of pixels in the image enhancement view. However, these methods tend to lock low-level image cues, such as color or texture, and clustering relies strongly on initialization of the network, which all can create difficulties for training of the network. Yet another bottom-up approach utilizes low-level or mid-level visual priors such as edge detection or saliency estimation to find image regions that may share the same semantics and uses the image regions to learn pixel embedding that captures the semantic information. In this way, the image region acts as a regularizer, eliminating the dependency of segmentation on network initialization. The pixel embedding is then clustered by K-means, etc. to obtain image segmentation. Although the bottom-up methods achieve better results, they suffer from a number of drawbacks: the reliance on hand-made priors (e.g., edges or salience) to group pixels limits their use. For example, saliency estimation is only applicable to object-centric images. In addition, some works require markers to identify the appropriate image area.

The pixel level characterization learning and clustering of images using the pixel attention mechanism and SwAV in patent 1 is performed and the resulting pseudo-labels of all pixels are used to guide the segmentation model (modified deeplabv3++) for training. In the method, although the method is more specific than learning image level characterization, in end-to-end learning, clustering at the pixel level is easy to directly focus on low-level image features (such as color, contrast and the like) and higher-level semantic information is ignored.

In paper 1, the image is transformed based on photometric invariance and geometric alike. Luminosity invariance means that when the illumination intensity of an image slightly shakes, pixels at the same position should be divided into the same label and not changed, namely, the characteristic representation obtained after each pixel point is subjected to two different luminosity transformations should be the same; the geometric isomorphism refers to that when a picture is enlarged and reduced, the label result of the object should be an enlarged and reduced version of the original image after processing. By constraining the invariant and the invariant, and using the roll-in neural network to perform feature representation, the overall segmentation effect can be improved. In the scheme, the resnet-18 is used for characteristic representation, and the convolutional neural network is insensitive to small changes of illumination, position and the like, so that the changes of illumination, geometric transformation and the like need to be performed in advance to construct and the like to constrain the clusters. In addition, the mode of extracting the features layer by the convolutional neural network is not beneficial to subsequent clustering as the meaning of the deepened features of the network is not clear.

Disclosure of Invention

In order to solve the problems, the invention discloses an unsupervised semantic segmentation method based on fine-grained feature grouping, which comprises the steps of extracting features extracted by a convolutional neural network by using a capsule network, reserving important information, reducing interference of irrelevant information and improving the overall segmentation effect. On the other hand, the super-pixel segmentation is used for pre-segmenting the picture, and the boundaries of the pre-segmented areas are used for guiding the subsequent pixel characteristics to carry out fine granularity grouping, so that the problem of poor segmentation in the case of fuzzy boundary is solved.

The "capsules" in the capsule network represent various features of a particular entity in the image, such as position, size, orientation, speed, hue, texture, etc., as a single logical unit. Then, using a protocol routing algorithm, when a capsule passes its own learned and predicted data to a higher level capsule, if the predictions agree, the higher level capsule becomes active, a process called dynamic routing. With the continued iteration of the routing mechanism, various capsules can be trained into units that learn different ideas. Meanwhile, the capsule network requires the model to learn feature variables in the capsule, and the valuable information is reserved to the maximum extent, so that the obtained pixel features are more easily clustered compared with CNN by using the model as a network for representing the image features.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

an unsupervised semantic segmentation method based on fine granularity feature grouping comprises the following steps:

step 1, obtaining an image to be segmented;

step 2, inputting the image into an unsupervised semantic segmentation model to obtain a semantic segmentation result, and specifically comprising the following steps:

step 2-1, extracting a feature map, which comprises the following steps:

the method comprises the steps that an Encoder module in an unsupervised semantic segmentation model adopts a DCNN with ASPP to conduct multi-scale feature extraction on an image, and a multi-scale feature map is obtained;

the original image is convolved by 7x7 to obtain a low-level semantic feature map;

step 2-2, using super-pixel segmentation as a priori, and carrying out fine-grained feature grouping and fusion on the feature map by using an FFG-Capsule module; the method comprises the following steps:

super-pixel segmentation is carried out on an original input image by using a slec method, and a segmentation boundary based on low-level semantic information is obtained;

taking the segmentation boundary as a priori, and carrying out fine-granularity feature grouping on the low-level semantic feature map according to the segmentation boundary of the super-pixel block, wherein each feature block is divided into a group; performing center difference processing in the feature block to obtain a center difference diagram;

performing feature grouping on the multi-scale feature map according to the segmentation boundary of the super pixel block;

inputting the multi-scale feature images subjected to feature grouping and the central difference image into a Capsule layer together for feature screening to obtain a semantic feature image;

and 2-3, sending the semantic feature map output by the FFG-Capsule module to a Decoder module to obtain a segmentation result.

Furthermore, the unsupervised semantic segmentation model is trained in an unsupervised manner.

Further, when performing unsupervised training, based on the feature map output by the FFG-Capsule module, clustering pixel-level features of a plurality of images by using k-means to obtain a cluster map of each image, and taking the cluster map as a pseudo tag of a network to participate in unsupervised semantic segmentation model training.

Further, the model training process further comprises a Auxiliary Decoder module, wherein the largest vector among the plurality of Capsule vectors output by the FFG-Capsule module is input into the Auxiliary Decoder module, images are reconstructed through 2-layer full connection, and training of the network is supervised based on the difference between the reconstructed images and the original images.

Further, the Decoder module includes a 3x3 convolutional layer and a softmax module.

The beneficial effects of the invention are as follows:

according to the invention, the ultra-pixel segmentation is carried out on the original image by using the slec method, so that a segmentation boundary based on low-level semantic information is obtained, and the subsequent high-level semantic information is subjected to fine-granularity feature grouping according to the segmentation boundary and is used as the input of a Capsule layer, so that the number of capsules can be effectively reduced.

The prior art, the Resnet50 uses maximum pooling to downsample image features, which may result in partial image information loss. The slice algorithm performs image segmentation according to the semantic information of the shallow layer, so that the problem of excessive segmentation can occur, and as the network structure deepens, the obtained advanced semantic information is not clearly supervised during clustering, so that the segmentation effect can be poor. In order to solve the problem, the invention takes the result of the segmentation of the slec according to low-level semantic information as a priori, fuses the result into the whole network, and carries out center difference in each super-pixel block so as to highlight the detail information of the segmentation edges in the pixel blocks. And fusing the differential image and the multi-scale feature map to be used as the input of a capsule layer, selecting representative multi-scale features through the capsule layer, merging the segmentation boundaries, improving the expression capacity of the network on the segmentation boundaries and improving the segmentation effect of the region boundaries.

According to the invention, capsuleNet decoder is used as an auxiliary network, so that the training of the segmentation network is supervised, and the model segmentation effect is improved.

Drawings

Fig. 1 is a schematic diagram of a model architecture in an unsupervised semantic segmentation method based on fine-grained feature grouping provided by the invention.

Description of the embodiments

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The invention provides an unsupervised semantic segmentation method based on fine granularity feature grouping, the whole realization architecture is shown in figure 1, and the unsupervised semantic segmentation method comprises the following steps:

and step 1, acquiring an image to be segmented.

And 2, inputting the image into an unsupervised semantic segmentation model to obtain a semantic segmentation result.

The whole framework of the unsupervised semantic segmentation model is based on deeplabv3+, changes the Encoder Encoder and Decoder Decoder modules, and an auxiliary network is added. The training of the model adopts an unsupervised mode.

The network structure of the unsupervised semantic segmentation model is shown in fig. 1, and is integrally composed of an Encoder, a Decoder and a Auxiliary Decoder auxiliary Decoder. The Encoder part carries out multi-scale feature extraction on the image by using a DCNN with ASPP (Atrous Spatial Pyramid Pooling cavity space pyramid pooling) in the deeplabv & lt3+ & gt to obtain a multi-scale feature map. In architecture, the backbone network is a resnet50. For the multi-scale feature map, the original deeplabv3+ is directly spliced, and then 1x1 convolution is connected to carry out multi-scale fusion. In order to reduce the interference of the background, the obtained pixel characteristics are more representative, and the accuracy of subsequent clustering is improved. In the process of fusing the multi-scale features, the invention uses an FFG-Capsule module (Fine-grained feature grouping-Capsule Mobile, abbreviated as FFG-Capsule) to fuse and screen the features.

The input of the FFG-Capsule module mainly comprises two parts, wherein one part is low-level semantic information, namely a feature map (a first convolution layer of a resnet 50) U1 obtained by 7x7 convolution of an original image; the other part is the characteristic diagram output by the ASPP module. The original resnet50 second layer would be a maxpooling layer that can get global information and reduce feature dimensions, but also results in image information loss, especially blurring edge information of semantic segmentation. In order to improve the segmentation effect of the region boundary, the patent fuses the processed feature map U1 with the feature map output by the ASPP module. The processing mode of the characteristic diagram U1 is as follows: the method comprises the steps of firstly, performing super-pixel segmentation on an original input image by using a slec (simple linear iterative clustering) algorithm to obtain a plurality of segmentation blocks and segmentation boundaries, and taking the segmentation boundaries as priori. Typically, the segmentation boundaries derived from low-level semantics often include the final segmentation boundaries. After the slic segmentation, the image is segmented into super-pixel blocks (256 segments in a preset number) with a size, and a segmentation boundary based on low-level semantic information is acquired. And carrying out fine granularity feature grouping on the feature map U1 according to the super pixel blocks, wherein each feature block is divided into a group. To enhance the boundary information, a center difference (similar to CDC) is performed within the feature block. And then, carrying out feature grouping on the multi-scale feature map obtained by ASPP according to the segmentation boundary of the super pixel block, taking the multi-scale feature map and the central difference map as input, and carrying out feature screening by using a Capsule layer so as to obtain a more representative semantic feature map.

For the feature map output by the FFG-Capsule module, the feature map is sent to the Decoder module, and a segmentation map with the same size as the input image is obtained through a 3x3 convolution and softmax activation function. And during unsupervised training, obtaining pixel-level features of a plurality of images, and clustering the features by using k-means to obtain a cluster map of each image, wherein the cluster map is used as a pseudo tag of a network to participate in training.

In addition, in order to improve the segmentation effect, the patent adds a Auxiliary Decoder module after the Encoder. The module is similar to the Decoder in a capsule net. Taking the vector with the largest model in a plurality of Capsule vectors output by the FFG-Capsule module as input, reconstructing an image through 2-layer full connection, and supervising the training of the network by the difference between the image and the original image. The module only participates in training and does not participate in reasoning.

The loss function of the invention in model training comprises two: one is a segmentation network, which calculates the cross entropy loss between the segmentation map and the pseudo-labels; the other is an auxiliary network that calculates the euclidean distance between the original image and the reconstructed image.

It should be noted that the foregoing merely illustrates the technical idea of the present invention and is not intended to limit the scope of the present invention, and that a person skilled in the art may make several improvements and modifications without departing from the principles of the present invention, which fall within the scope of the claims of the present invention.

Claims

1. An unsupervised semantic segmentation method based on fine granularity feature grouping is characterized by comprising the following steps:

step 1, obtaining an image to be segmented;

step 2-1, extracting a feature map, which comprises the following steps:

2. The fine-grained feature grouping based unsupervised semantic segmentation method according to claim 1, wherein the unsupervised semantic segmentation model is trained in an unsupervised manner.

3. The unsupervised semantic segmentation method based on fine granularity feature grouping according to claim 2, wherein when unsupervised training is performed, based on a feature map output by an FFG-Capsule module, clustering pixel-level features of a plurality of images by using k-means to obtain a cluster map of each image, and taking the cluster map as a pseudo tag of a network to participate in unsupervised semantic segmentation model training.

4. The method for unsupervised semantic segmentation based on fine-grained feature grouping according to claim 1 or 2, further comprising a Auxiliary Decoder module during model training, wherein the largest vector among the plurality of Capsule vectors outputted by the FFG-Capsule module is inputted into the Auxiliary Decoder module, the image is reconstructed through 2-layer full connection, and training of the network is supervised based on the difference between the reconstructed image and the original image.

5. The fine granularity feature grouping based unsupervised semantic segmentation method according to claim 1, wherein the Decoder module comprises a 3x3 convolutional layer and softmax module.