CN114359554A

CN114359554A - Image semantic segmentation method based on multi-receptive-field context semantic information

Info

Publication number: CN114359554A
Application number: CN202111413182.1A
Authority: CN
Inventors: 刘亮亮; 常靖
Original assignee: Henan Agricultural University
Current assignee: Henan Agricultural University
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-15

Abstract

The invention discloses an image semantic segmentation method based on multi-receptive field context semantic information, which comprises the following steps of converting an input image into a pixel matrix through convolution operation; secondly, converting the same pixel matrix into a plurality of characteristic graphs with multi-receptive-field context semantic information by adopting expansion convolution with different expansion rates; thirdly, feature images with context semantic information of multiple receptive fields are respectively subjected to feature extraction and down-sampling processing through converter encoders in different subnets to obtain a plurality of down-sampling feature images with different receptive fields; step four, the down-sampling feature maps are subjected to up-sampling processing step by step through a decoder to obtain feature maps with the same size and dimension, and a final feature fusion map is generated; and fifthly, the feature fusion graph completes the prediction of image segmentation through a convolution neural network. The method can be effectively applied to image semantic segmentation, deep low-resolution features and fine-grained features cannot be lost, memory consumption is low, the effect is obvious, and the method is convenient to popularize.

Description

Image semantic segmentation method based on multi-receptive-field context semantic information

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to an image semantic segmentation method based on multi-receptive-field context semantic information.

Background

Image semantic segmentation is the basis for image analysis and also for many applications, such as object recognition in autonomous driving systems, unmanned aerial vehicle applications, and wearable device applications. An image is composed of pixels, and "semantic segmentation" as the name implies, is to group or segment pixels according to differences in the image that express semantic meaning. The goal of semantic segmentation of an image is to label the class to which each pixel of the image belongs. Thus, semantic segmentation refers to identifying the target tissue in an image at the pixel level, i.e., noting the class of objects to which each pixel in the image belongs.

Many segmentation methods have emerged in the development of image segmentation, such as: the method is based on simple pixel level 'Thresholding methods' (Thresholding methods) and pixel Clustering-based segmentation methods (Clustering-based segmentation methods), and 'graph partitioning' segmentation methods, but these segmentation methods have difficulty meeting the current requirements for high-precision segmentation performance. With the successive proposal of a series of convolutional neural network-based semantic segmentation methods represented by full convolutional neural networks (FCNs), at present, the architecture of advanced image semantic segmentation models is almost based on convolutional networks, and generally follows a pattern: the network is divided into an encoder and a decoder, the encoder is usually based on an image classification network, also called backbone, which is pre-trained on a large corpus (e.g. ImageNet); the decoder aggregates the features from the encoder and converts them into a final feature map for prediction. Previous segmentation architecture studies have generally focused on the decoder and its aggregation strategy, but in practice the size of the image features and the backbone architecture of the model are critical to the overall model, since information lost in the encoder is not likely to be recovered in the decoder. In addition, the existing models are not concerned with the acquisition of different feature information and the improvement in the extraction and screening of features in the encoder.

In the prior art, an encoder gradually down-samples an input image through operations such as convolution, pooling and the like, extracts feature information, and down-sampling gradually increases the receptive field of a model to abstract low-level features into high-level features. However, the downsampling operation has significant disadvantages, especially in the pixel-level prediction task, which can result in low-resolution features and fine-grained features being lost at deeper layers of the model, and such lost information is difficult to recover in the decoder. Although pixel feature resolution and granularity may be of little importance for certain tasks such as image classification, they are of vital importance for pixel-based segmentation tasks, and ideally, the model should minimize the loss of feature information during the downsampling process, i.e., enable the resolution of the input image to be equal to or close to the resolution of the input image, with no apparent difference between the input image and the output image.

In addition, in the prior art, convolution and nonlinear modules together form a basic computing unit of an image analysis network, convolution is a linear operator with a limited receptive field, and the limited receptive field and limited expressive power of a single convolution require sequence superposition into a very deep architecture to obtain a sufficiently wide background and a sufficiently high characterization capability. However, this requires the production of many intermediate characterizations, consuming a large amount of memory. In order to keep memory consumption at a level where existing computer architectures are feasible, it is necessary to reduce the sampling of the intermediate representation.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an image semantic segmentation method based on context semantic information of multiple receptive fields for overcoming the above-mentioned deficiencies in the prior art, the method has simple steps, reasonable design and convenient realization, can be effectively applied to the semantic segmentation of the image, adopts the structure of an encoder and a decoder, adopts the extended convolution of multiple reception fields to obtain the characteristics of different size resolutions from the same image, and the converter is used as a basic calculation building block of the encoder to be recombined into image-like feature representations under various resolutions, and a convolutional decoder is used for gradually combining the characteristic representations into the final pixel prediction to complete the identification and segmentation of the target tissue, meanwhile, the intermediate representation is reduced, deep low-resolution features and fine-grained features cannot be lost, the memory consumption is low, the effect is remarkable, and the popularization is convenient.

In order to solve the technical problems, the invention adopts the technical scheme that: an image semantic segmentation method based on multi-receptive field context semantic information comprises the following steps:

converting an input image into a pixel matrix through convolution operation;

step two, adopting the expansion convolution with different expansion rates to convert the same pixel matrix into a plurality of characteristic graphs with multi-receptive-field context semantic information;

thirdly, the feature map with the multi-receptive-field context semantic information is subjected to feature extraction and down-sampling processing through converter encoders in different subnets respectively to obtain a plurality of down-sampling feature maps with different receptive fields;

step four, the down-sampling feature map is subjected to up-sampling treatment step by step through a decoder to obtain feature maps with the same size and dimension, and a final feature fusion map is generated;

and fifthly, the feature fusion graph completes the prediction of image segmentation through a convolutional neural network.

In the above image semantic segmentation method based on multi-receptive field context semantic information, the output of the dilation convolution in step two is:

wherein, yⁱRepresents the ith output of the expansion convolution, the convolution kernel of the expansion convolution has the size of k x k, the expansion rate is r, xⁱFor the ith input signature mapping of the dilated convolution before the converter subnetwork, m is the filter matrix w k with convolution kernel size k x k]Length of (d).

In the image semantic segmentation method based on the multi-receptive-field context semantic information, in the third step, the converter comprises a multi-head self-attention mechanism model and a multi-layer perceptron model, and the input of the multi-head self-attention mechanism model and the input of the multi-layer perceptron model are normalized.

In the above image semantic segmentation method based on multi-receptor-field context semantic information, the output of the multi-head attention mechanism model is as follows:

Y_out＝concat[y₁,y₂,...y_i...y_h-1]

y_i＝Attentation(qW_i ^q,jW_i ^j,dW_i ^d)

wherein, Y_out∈R^q*j*d，concat[]Denotes a join operation, i ∈ [1, h-1 ]]H is the self-attention block number, each block has its own set of learnable weight matrices (W)_i ^q,W_i ^j,W_i ^d) W is the weight matrix of the projection, W is the R^q*j*dAnd q, j and d are three-dimensional dimensions of the first characteristic diagram.

According to the image semantic segmentation method based on the multi-receptive-field context semantic information, the output end of the multilayer perceptron model is provided with the remodeling layer, and the remodeling layer comprises a reshape layer used for changing the dimensionality of input data.

In the image semantic segmentation method based on the multi-receptive-field context semantic information, the decoder performs gradual upsampling in the fourth step to obtain the feature fusion image with the same size and dimension, and the specific process of obtaining the feature fusion image comprises feature fusion between the encoder and the decoder and output feature fusion of the decoder.

In the above method for segmenting image semantics based on multi-receptive field context semantic information, the specific process of feature fusion between the encoder and the decoder includes: and fusing the characteristic diagrams between corresponding layers of the encoder and the decoder through skip-connection operation, and reducing gradient disappearance and network degradation through skip-connection jumping connection.

In the above method for segmenting image semantics based on multi-receptive field context semantic information, the specific process of the decoder output feature fusion includes: the decoder performs layer-by-layer and continuous up-sampling on the feature maps from different encoders, outputs features with the same size and dimension, and then fuses the output feature maps through a concat () method.

In the image semantic segmentation method based on the multi-receptive-field context semantic information, in the fifth step, the convolutional neural network adopts a softmax activation function to complete the prediction of image segmentation.

Compared with the prior art, the invention has the following advantages:

1. the method has simple steps, reasonable design and convenient realization.

2. The invention adopts an encoder and a decoder framework, adopts the extended convolution of multiple receptive fields to obtain the characteristics of different sizes and resolutions from the same image, utilizes a converter as a basic calculation building block of the encoder, recombines the characteristics into image-like characteristic representations under various resolutions, and gradually combines the characteristic representations into final pixel prediction by using a convolution decoder to complete the identification and the segmentation of target tissues, and simultaneously reduces intermediate characteristics.

3. The converter abandons explicit down-sampling operation after calculating initial image embedding, and keeps unchanged dimension representation in all processing stages, so that fine-grained pixels and global prediction results are kept consistent.

4. The invention has remarkable advantages in accuracy and training efficiency; in the method architecture, the method is a simple and expanded framework, and the multi-sensing-field spatial characteristics and the converter encoder can be increased or decreased according to the actual situation; compared with other semantic segmentation methods, the method has the advantages that the generalization performance of the model is improved due to the introduction of multiple receptive fields and the converter.

5. The method can be effectively applied to image semantic segmentation, deep low-resolution features and fine-grained features cannot be lost, memory consumption is low, the effect is obvious, and the method is convenient to popularize.

In conclusion, the method has the advantages of simple steps, reasonable design and convenient implementation, can be effectively applied to image semantic segmentation, does not lose deep low-resolution features and fine-grained features, consumes little memory, has obvious effect and is convenient to popularize.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

As shown in fig. 1, the image semantic segmentation method based on context semantic information of multiple receptive fields of the present invention includes the following steps:

converting an input image into a pixel matrix through convolution operation;

the output of the dilation convolution is:

For example, in a target segmentation neural network, generally, prediction needs to be performed on the last layer of feature map, and how many pixels of an original image can be mapped by one point on the feature map determines an upper limit of the size that can be detected by the network, that is, the receptive field. And the extraction of image characteristic information is ensured by down-sampling, and the result of the down-sampling is that small targets are not easy to detect. However, each pixel feature in the image determines the segmentation result more or less, and objects that are not detected will affect the segmentation result. In neural networks, convolution kernels are used to filter contextual semantic information of pixels from an input image. If the parameters of the convolution kernel are fixed, the size of the extracted feature information is fixed, and the obtained feature information is limited, which easily causes the model to focus on only a certain class or a certain size of segmentation target, thereby causing the deviation of the segmentation performance of the model. In addition, the acceptance of convolution kernels is limited, focusing on local features only. In order to alleviate the problems brought by the traditional convolution operation, the invention generates different receptive field characteristics according to different convolution kernel sizes, adopts the expansion convolution capable of setting the convolution kernel proportion (rate) as the initialization convolution for multi-receptive field characteristic acquisition, and adopts the multi-receptive field convolution to acquire characteristic information of different size spaces from the same image, thereby providing more and richer characteristics for a segmentation model and better decision support for the model.

the converter comprises a multi-head self-attention mechanism model and a multilayer perceptron model, and the input of the multi-head self-attention mechanism model and the input of the multilayer perceptron model are normalized;

the output of the multi-head self-attention mechanism model is as follows:

Y_out＝concat[y₁,y₂,...y_i...y_h-1]

y_i＝Attentation(qW_i ^q,jW_i ^j,dW_i ^d)

wherein, Y_out∈R^q*j*d，concat[]Denotes a join operation, i ∈ [1, h-1 ]]H is the self-attention block number, each block has its own set of learnable weight matrices (W)_i ^q,W_i ^j,W_i ^d) W is the weight matrix of the projection, W is the R^q*j*dQ, j, d are the three-dimensional dimensions of the first feature map;

in the task of image analysis, the goal of the single-head self-attention mechanism is to extract the interaction relation among all pixels by performing global context feature coding on each pixel. However, the single-headed self-attention mechanism is limited in that it focuses only on one specific location. In the present invention, a multi-head attention Mechanism (MHA) is used as a component of the proposed converter. Multiple-head attention is one of the attention mechanisms, and multiple independent parallel attentions can be simultaneously performed on different important positions. In particular, in a multi-headed attention mechanism, different randomly initialized mapping matrices can map input vectors to different subspaces, which helps the model analyze the input sequence from different angles. In addition, the multi-head attention mechanism can connect multiple self-attentions, and the total computational consumption is reduced by reducing dimensionality.

The output end of the multilayer perceptron model is provided with a remolding layer, and the remolding layer comprises a reshape layer for changing the dimension of input data.

In order to enable the output feature vector in the global branch to have the same dimension as the output feature vector of the bottleneck layer in the local branch, the invention adds a remolding layer of a reshape layer behind the multilayer perceptron model, so that the dimension of input data can be changed, but the content is kept unchanged.

the method comprises the steps of feature fusion between an encoder and a decoder and decoder output feature fusion; the specific process of feature fusion between the encoder and the decoder comprises the following steps: fusing the characteristic diagrams between corresponding layers of the encoder and the decoder through skip-connection operation, and reducing gradient elimination and network degradation through skip-connection jumping connection; the specific process of the decoder output feature fusion comprises the following steps: the decoder performs layer-by-layer and continuous up-sampling on the feature maps from different encoders, outputs features with the same size and dimension, and then fuses the output feature maps through a concat () method.

Through skip-connection, the problems of gradient loss and network degradation can be effectively reduced, and the training is easier; when normal propagation is carried out on the network, the deep gradient can be more easily transmitted back to the shallow layer, and due to the structure, the number of layers of the neural network can be set more randomly, so that the reuse of the features can be improved, and the problem of feature loss in the downsampling process of the encoder is solved.

Fifthly, the feature fusion graph completes the prediction of image segmentation through a convolutional neural network;

and the convolutional neural network adopts a softmax activation function to complete the prediction of image segmentation.

To verify the validity of the method of the present invention, the present invention verifies the object segmentation task in the PASCAL VOC 2012 data set whose main purpose is to establish a challenge task for identification of visual objects in the actual scene, the PASCAL VOC 2012 data set contains 20 object classes and 1 background class, wherein 1464 images are used for training, 1449 images are used for verification, 1456 images are used for testing, and the segmentation objects in each image in the data set sample are labeled. The results of some of the tests are shown in table 1 (mlou is the average segmentation accuracy).

TABLE 1 PASCAL VOC 2012 data set test results

Model	bird	bottle	bus	car	chair	cow	mbike	plant	mIOU(％)
										DSNA	76.8	53.9	80.6	67.7	21.0	70.2	74.8	45.2	60.1
SRA-CFM	69.5	65.6	81.0	69.2	30.0	68.7	71.7	50.4	61.8
										SegNet	78	61.2	82.5	77.5	29.9	68.7	74.0	56.3	66.7
DGCM	79.8	69.8	86.7	81.2	30.8	70.5	78.0	56.7	69.8
										GCRF	83.3	71.8	89.0	82.7	31.1	79.5	80.5	58.9	73.2
DPN	78.4	72.3	89.3	83.5	31.7	79.9	79.8	59.5	74.1
										The invention	85.9	76.2	90.2	84.1	33.8	78.9	81.6	58.6	74.7

In addition, the invention also performs experiments on the Cityscapes data set, and further verifies the generalization of the method. Cityscaps is a data set for semantic urban scene understanding with 5000 images taken from 50 urban environments, with 19 semantic classes of high quality pixel-level labels, 2979 images in the training set, 500 images in the validation set, 1525 images in the test set, and some of the experimental results shown in table 2.

TABLE 2Cityscapes data set test results

model	mIOU(％)
		ICNet	67.7
BisNet	69.0
		LRR	70.0
DeepLab	71.4
		The invention	72.3

Test results show that the method can effectively improve the object segmentation precision. Compared with the traditional single-scale segmentation method, the method has remarkable advantages in accuracy and training efficiency; the invention is a simple and expanded framework on the method framework, and the multi-sensing field space characteristics and the converter encoder can be increased or decreased according to the actual situation; compared with other semantic segmentation methods, the method has the advantages that the generalization performance of the model is improved due to the introduction of multiple receptive fields and the converter.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. An image semantic segmentation method based on multi-receptive field context semantic information is characterized by comprising the following steps:

converting an input image into a pixel matrix through convolution operation;

2. The image semantic segmentation method based on the context semantic information of multiple receptive fields according to claim 1, wherein the output of the dilation convolution in the second step is:

3. The method for image semantic segmentation based on multi-receptive-field context semantic information as claimed in claim 1, wherein the converter in step three comprises a multi-head auto-attention mechanism model and a multi-layer perceptron model, and inputs of the multi-head auto-attention mechanism model and the multi-layer perceptron model are normalized.

4. The image semantic segmentation method based on the context semantic information of multiple receptive fields according to claim 3, wherein the output of the multi-head self-attention mechanism model is as follows:

Y_out＝concat[y₁,y₂,...y_i...y_h-1]

y_i＝Attentation(qW_i ^q,jW_i ^j,dW_i ^d)

5. The image semantic segmentation method based on the multi-receptive-field context semantic information as claimed in claim 3, wherein a reshaping layer is arranged at an output end of the multi-layer perceptron model, and the reshaping layer comprises a reshape layer for changing the dimension of input data.

6. The image semantic segmentation method based on the multi-receptive-field context semantic information as claimed in claim 1, wherein the decoder performs the step-by-step upsampling in step four, and the specific process of obtaining the feature fusion map with the same size and dimension includes the feature fusion between the encoder and the decoder and the output feature fusion of the decoder.

7. The method as claimed in claim 6, wherein the specific process of feature fusion between the encoder and the decoder includes: and fusing the characteristic diagrams between corresponding layers of the encoder and the decoder through skip-connection operation, and reducing gradient disappearance and network degradation through skip-connection jumping connection.

8. The method for image semantic segmentation based on multi-receptive-field context semantic information as claimed in claim 6, wherein the specific process of the decoder output feature fusion comprises: the decoder performs layer-by-layer and continuous up-sampling on the feature maps from different encoders, outputs features with the same size and dimension, and then fuses the output feature maps through a concat () method.

9. The method for image semantic segmentation based on multi-receptive-field context semantic information as claimed in claim 1, wherein in step five, the convolutional neural network uses softmax activation function to complete the prediction of image segmentation.