CN113554657A

CN113554657A - Super-pixel segmentation method and system based on attention mechanism and convolutional neural network

Info

Publication number: CN113554657A
Application number: CN202110943140.2A
Authority: CN
Inventors: 王晶晶; 栾振业; 于子舒; 任金雯; 张立人
Original assignee: Shandong Aowande Information Technology Co ltd
Current assignee: Shandong Aowande Information Technology Co ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2021-10-26

Abstract

The present disclosure provides a superpixel segmentation method and system based on an attention mechanism and a convolutional neural network. Second, the resulting crush and motivating network is trained end-to-end. The superpixels for a particular task are then learned with a flexible loss function. And finally, the superpixel segmentation of the image can be carried out through the trained network, the dimensionality is greatly reduced, and some abnormal pixels are eliminated. A better segmentation result can be obtained through an algorithm based on an attention mechanism and the convolution neural network superpixel segmentation, and a method with obvious advantages is provided for the field of image superpixel segmentation.

Description

Super-pixel segmentation method and system based on attention mechanism and convolutional neural network

Technical Field

The disclosure belongs to the technical field of image processing, and particularly relates to a superpixel segmentation method and system based on an attention mechanism and a convolutional neural network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Superpixel segmentation is an important pre-processing step in computer image processing. In computer vision, superpixels are large elements that are more representative of pixels with similar characteristics. This new element will become the basic unit for other image processing algorithms. It not only can greatly reduce the size, but also can eliminate some abnormal pixels. A large number of experimental analyses show that the super-pixel segmentation method based on deep learning is superior to the existing super-pixel algorithm on the traditional segmentation reference, and can also learn the super-pixels for other tasks. Furthermore, the deep learning network can be easily integrated into a downstream deep network, thereby improving performance. Currently, due to its representativeness and computational efficiency, superpixels have been widely used in computer vision algorithms such as object detection, semantic segmentation, saliency estimation, optical flow estimation, depth estimation, tracking, and the like.

The inventor finds that the existing super pixel segmentation method has the following defects:

(1) most gradient-based superpixels start with an initial clustering of pixels, with the clusters being iteratively updated by gradient changes until certain criteria are met to form the superpixel.

(2) Slic (simple Linear Iterative clustering) is a K-means clustering based superpixel segmentation algorithm that, while able to control the number and degree of compaction of superpixel blocks. However, this method only considers the color and coordinate relationship between the pixel point and the seed point, and does not consider the relationship between the pixel point and the boundary, so the fitting degree to the image boundary is not good.

(3) DB-SCAN (Density-based Spatial Clustering of Applications with Noise) is a superpixel segmentation algorithm based on Density Clustering. The DB-SCAN clustering algorithm can find clusters in any shapes, so that the method has good segmentation potential on objects with complex and irregular shapes, but does not consider the spatial relationship between pixel points and seed points, so that the shape of the super pixel is irregular.

Disclosure of Invention

The super-pixel segmentation method and the super-pixel segmentation system based on the attention mechanism and the convolutional neural network are provided for solving the problems.

According to a first aspect of the embodiments of the present disclosure, there is provided a superpixel segmentation method based on an attention mechanism and a convolutional neural network, including:

acquiring an image to be subjected to superpixel segmentation;

inputting the image into a pre-trained super-pixel segmentation model to obtain a predicted super-pixel association diagram, and determining an image super-pixel segmentation result based on the super-pixel association diagram;

the super-pixel segmentation model is designed by adopting an encoder-decoder, the encoder comprises a plurality of convolution layers, and the images generate feature maps with different scales through the convolution layers at different levels in the encoder; the decoder comprises a plurality of deconvolution layers, and the feature maps generated by the convolution layers of different levels in the encoder are transmitted to the deconvolution layers of corresponding levels in the decoder through jump connection; meanwhile, an attention module is arranged in front of the input of each deconvolution layer in the decoder.

Further, the attention module adopts an SE-Net attention module which comprises a squeezing operation and an excitation operation, wherein the squeezing operation generates a channel descriptor by gathering the feature maps in a spatial dimension; the excitation operation uses a gating mechanism, and takes the embedding of the global distribution generated by the channel descriptor as an input to obtain a set of modulation weights of each channel.

Further, the determining a super-pixel segmentation result of the image based on the super-pixel correlation map specifically includes: obtaining a predicted superpixel association graph through the superpixel segmentation model, wherein the superpixel association graph determines the probability of each pixel being allocated to different grid units on the basis of a soft association graph instead of actual hard pixel allocation; the superpixel segmentation result is obtained by assigning each pixel to a grid cell having the highest probability.

Further, the loss function adopted in the super-pixel segmentation model training process comprises two parts, wherein the first part is used for combining pixels with similar attributes; the second part is used to enforce constraints on the superpixel to remain compact in space.

According to a second aspect of the embodiments of the present disclosure, there is provided a superpixel segmentation system based on an attention mechanism and a convolutional neural network, including:

a data acquisition unit for acquiring an image to be subjected to superpixel segmentation;

a superpixel segmentation unit, which is used for inputting the image into a pre-trained superpixel segmentation model, obtaining a predicted superpixel association diagram, and determining an image superpixel segmentation result based on the superpixel association diagram;

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and running on the memory, wherein the processor implements the method for superpixel segmentation based on an attention mechanism and a convolutional neural network when executing the program.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for superpixel segmentation based on an attention mechanism and a convolutional neural network.

Compared with the prior art, the beneficial effect of this disclosure is:

(1) the scheme of the disclosure provides a superpixel segmentation method based on an attention mechanism and a convolutional neural network, and the scheme is characterized in that a superpixel segmentation model is constructed by introducing an SE-Net attention module into the convolutional neural network, the scheme utilizes extrusion and excitation networks generated by SE-Net to carry out end-to-end training, and utilizes a flexible loss function to learn superpixels of a specific task; and finally, the superpixel segmentation of the image can be carried out through the trained network, the dimensionality is greatly reduced, and some abnormal pixels are eliminated.

(2) Compared with the traditional FCN convolution, the scheme disclosed by the invention has the advantages that the attention module is added into the convolutional neural network, the dependency relationship between the channels can be better simulated, and the characteristic response value of each channel can be adaptively adjusted. Adding an attention module to the network adds only a small amount of computational overhead, but can greatly improve network performance.

(3) The scheme of the disclosure can effectively improve the efficiency and accuracy of superpixel segmentation by finding the associated scores between image pixels and regular grid cells, directly predicting the scores by using a squeezing and excitation network, and obtaining a final superpixel segmentation result by allocating each pixel to the regular grid cell with the highest probability to obtain the superpixel.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a schematic diagram illustrating an overall network structure of a superpixel segmentation method based on an attention mechanism and a convolutional neural network according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a super-pixel squeeze and excitation network structure according to a first embodiment of the disclosure;

FIG. 3 is a schematic diagram illustrating a configuration of a squeeze and fire module according to a first embodiment of the present disclosure;

FIG. 4 is a diagram illustrating the result of super-pixel segmentation on a BSDS500s data set according to an embodiment of the disclosure;

fig. 5 is a diagram illustrating a super-pixel segmentation result on the NYUv2 data set according to the first embodiment of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

The first embodiment is as follows:

the present embodiment is directed to a superpixel segmentation method based on an attention mechanism and a convolutional neural network.

A superpixel segmentation method based on an attention mechanism and a convolutional neural network comprises the following steps:

acquiring an image to be subjected to superpixel segmentation;

Further, the attention module adopts an SE-Net (Squeeze-and-Excitation Networks) attention module which comprises a squeezing operation and an Excitation operation, wherein the squeezing operation generates a channel descriptor by gathering the feature maps in a spatial dimension; the excitation operation uses a gating mechanism, and takes the embedding of the global distribution generated by the channel descriptor as an input to obtain a set of modulation weights of each channel.

Further, the predetermined loss function is specifically expressed as follows:

wherein p is a certain pixel point in the image, p 'is a certain pixel point in the reconstructed image, f (p) is the characteristic of the pixel point, f' (p) is the characteristic of the pixel point in the reconstructed image, dist () represents the difference between the reconstructed characteristic and the original characteristic and the difference between the reconstructed position and the original position, m is the weight for balancing the two items, and s is the sampling interval of the superpixel.

Specifically, for ease of understanding, the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings:

as shown in fig. 1, the scheme of the present disclosure employs an encoder-decoder design to predict a superpixel association graph Q through a superpixel segmentation model with a skip connection and an attention module. The encoder takes a color image as input and generates a high-level feature map through a convolutional network. The decoder then gradually upsamples the feature map by deconvolution. An attention module is added before each deconvolution to pay more attention to the feature weight of each super pixel, so that the segmentation accuracy is improved and the final prediction is carried out.

For any given transformation that maps an input to a feature map, such as a convolution operation, the scheme described in this disclosure sets a corresponding attention module to perform feature recalibration prior to each convolution operation. In the embodiment, an SE-Net attention module is adopted, and the SE-Net attention module comprises a squeezing operation and an excitation operation, and the input features of the convolution operation are firstly subjected to the squeezing operation to gather feature maps in a spatial dimension to generate a channel descriptor, and the function of the descriptor is to generate the embedding of global distribution of channel feature responses, so that the information of the global receptive field of the network can be used by all layers of the network. The squeeze operation is followed by an excitation operation in the form of a simple gating mechanism that takes as input the embedding produced by the squeeze operation, producing a set of per-channel modulation weights. These weights are applied to the feature map to produce the output of the attention module, which can be directly input to subsequent layers of the network. Then, we can write the output as U ═ U₁,u₂,...,u_C]Wherein:

wherein, represents the convolution of the data,

X＝[X¹,X²,...,X^C']，u_c∈R^H×W；

is a two-dimensional space kernel, representing v_cActing on the corresponding channel of X; x is an input image, v_cIs the c-th convolution kernel parameter, V_c ¹Is the first channel parameter, V, of the c-th convolution kernel_c ²The second channel parameter of the c-th convolution kernel, R is the image feature map, V_c ^C’Is the parameter of the channel C 'of the C-th convolution kernel, C is the number of convolution kernels, C' is the number of characteristic channels after convolution transformation, u₁For the output of the convolution kernel 1 operation, u₂For the output of the convolution kernel 2 operation, u_cFor the output of the convolution kernel c operation, H is the height of the pixel and W is the width of the pixel.

Further, the scheme disclosed by the invention passes through a soft correlation diagram Q epsilon R^H×W×|NP|(i.e., a superpixel association map that represents the probability that each pixel belongs to each superpixel) instead of the hard assignment G of pixels. For example:

where s is the center of the superpixel, N_pAs a set of superpixels around a pixel, q_s(p) denotes that one pixel p is assigned to each s e N_pThe probability of (c). Finally, the superpixel is obtained by assigning each pixel to the grid cell with the highest probability: s^*＝arg max_sq_s(p) of the formula (I). Different from a method based on a full convolution neural network, the method adds an attention mechanism in an original network structure, and can effectively improve the segmentation precision.

Further, the scheme of the present disclosure has good flexibility in terms of loss functions. In general, we mean by f (p) that we wish to exceedThe pixel attribute retained by the pixel, in this embodiment, f (p), includes a 3-dimensional CIELab color vector and an N-dimensional semantic tag one-time encoding vector, where N represents the number of classes. We further use the image coordinates p ═ x, y]^TIndicating the location of one pixel. In view of the predicted superpixel correlation map Q, we can compute the center of any superpixel, where u_sAs an attribute vector,/_sThe position vector is specifically expressed as follows:

where Np is the set of surrounding superpixels of p, q_s(p) is the probability that p is associated with a super-pixel that the net predicts. In equations (2) and (3), each sum is made for all pixels that may be assigned to s. The reconstruction property and position of any pixel p is then given by the following formula:

finally, our general expression of the loss function has two terms. The first term encourages trained models to combine pixels with similar attributes, and the second term enforces that superpixels are spatially compact.

And training the superpixel segmentation model based on the determined loss function, and realizing the superpixel segmentation of the image by using the trained model.

Further, to verify the advantages of the present disclosure for image superpixel segmentation, the solution described in the present embodiment of the present disclosure was subjected to a large number of superpixel segmentation experiments on BSDS500s dataset and NYUv2 dataset. Firstly, training a model on a training set in a data set, then verifying a verification set, and testing a test set. The experimental results are shown in fig. 4 and 5, respectively. As can be seen from fig. 4 and 5, the convolution neural network superpixel segmentation method based on the attention mechanism established in the present disclosure can achieve a good effect on the superpixel segmentation of the image, and the superpixel segmentation of the image can not only greatly reduce the image dimensionality, but also eliminate some abnormal pixels. The super-pixel segmentation method based on the attention mechanism is effective, improves the calculation efficiency in the fields of subsequent object detection, semantic segmentation, significance estimation, optical flow estimation, depth estimation, tracking and the like, and has certain practicability.

Example two:

the present embodiment is directed to a superpixel segmentation system based on attention mechanism and convolutional neural network.

A superpixel segmentation system based on an attention mechanism and a convolutional neural network, comprising:

Further, the determining a super-pixel segmentation result of the image based on the super-pixel correlation map specifically includes: obtaining a predicted superpixel association graph through the superpixel segmentation model, wherein the superpixel association graph determines the probability of each pixel being allocated to different grid units on the basis of a soft association graph instead of actual hard pixel allocation; the superpixel segmentation result is obtained by assigning each pixel to a grid cell having the highest probability. In further embodiments, there is also provided:

an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment one. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits AS ic, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The super-pixel segmentation method and the super-pixel segmentation system based on the attention mechanism and the convolutional neural network can be realized, and have wide application prospects.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A superpixel segmentation method based on an attention mechanism and a convolutional neural network is characterized by comprising the following steps:

acquiring an image to be subjected to superpixel segmentation;

2. The method of claim 1, wherein the attention module employs an SE-Net attention module comprising a squeeze operation and an excitation operation, the squeeze operation generating a channel descriptor by clustering feature maps in spatial dimensions; the excitation operation uses a gating mechanism, and takes the embedding of the global distribution generated by the channel descriptor as an input to obtain a set of modulation weights of each channel.

3. The method for superpixel segmentation based on an attention mechanism and a convolutional neural network as claimed in claim 1, wherein said determining an image superpixel segmentation result based on said superpixel association map specifically comprises: obtaining a predicted superpixel association graph through the superpixel segmentation model, wherein the superpixel association graph determines the probability of each pixel being allocated to different grid units on the basis of a soft association graph instead of actual hard pixel allocation; the superpixel segmentation result is obtained by assigning each pixel to a grid cell having the highest probability.

4. The superpixel segmentation method based on the attention mechanism and the convolutional neural network as claimed in claim 1, wherein the loss function adopted in the superpixel segmentation model training process comprises two parts, the first part is used for combining pixels with similar attributes; the second part is used to enforce constraints on the superpixel to remain compact in space.

5. The method of claim 4, wherein the predetermined loss function is expressed as follows:

6. A superpixel segmentation system based on an attention mechanism and a convolutional neural network, comprising:

7. The system of claim 6, comprising:

the attention module adopts an SE-Net attention module which comprises a squeezing operation and an excitation operation, wherein the squeezing operation generates a channel descriptor by gathering feature maps in a spatial dimension; the excitation operation uses a gating mechanism, and takes the embedding of the global distribution generated by the channel descriptor as an input to obtain a set of modulation weights of each channel.

8. The super-pixel segmentation system based on the attention mechanism and the convolutional neural network as claimed in claim 6, wherein the determining the image super-pixel segmentation result based on the super-pixel correlation map specifically comprises: obtaining a predicted superpixel association graph through the superpixel segmentation model, wherein the superpixel association graph determines the probability of each pixel being allocated to different grid units on the basis of a soft association graph instead of actual hard pixel allocation; the superpixel segmentation result is obtained by assigning each pixel to a grid cell having the highest probability.

9. An electronic device comprising a memory, a processor and a computer program stored and executed on the memory, wherein the processor when executing the program implements a method of superpixel segmentation based on an attention and convolutional neural network as claimed in any one of claims 1-5.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements a method for superpixel segmentation based on an attention-driven and convolutional neural network as claimed in any one of claims 1 to 5.