CN112580649B

CN112580649B - Semantic segmentation method based on regional context relation module

Info

Publication number: CN112580649B
Application number: CN202011478891.3A
Authority: CN
Inventors: 刘明皓; 杜江
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2022-08-02
Anticipated expiration: 2040-12-15
Also published as: CN112580649A

Abstract

The invention relates to a semantic segmentation method based on a regional context module, and belongs to the field of remote sensing image processing. The method comprises the following steps: s1: enhancing the remote sensing image; s2: constructing an RC-Module; s3: establishing a remote sensing image semantic segmentation model RC-Net based on RC-Moudele; s4: MIOU test and evaluation. The RC-moudlet is a derivative of an attention mechanism in the semantic segmentation model, so that the model can be guided to a certain degree by learning the context relation around each region, the model can learn the adjacent relation among the regions, the information amount for model classification is increased from the aspect of statistics, and the classification precision of semantic segmentation is increased. Meanwhile, the RC-Module is a plug-and-play Module and can be combined with any existing semantic segmentation model, so that the precision of the model is improved.

Description

Semantic segmentation method based on regional context relation module

Technical Field

The invention belongs to the field of remote sensing image processing, and relates to a semantic segmentation method based on a regional context module.

Background

Semantic segmentation is a dense prediction task that directly performs pixel-level prediction on each pixel of an image. At present, in the field of remote sensing images, due to the characteristics of high resolution and extremely high acquirability, an effective semantic segmentation model is extremely needed to be effectively applied to the remote sensing images, and an attention mechanism can be used for pertinently guiding the learning process of the semantic segmentation model, so that the model can learn more precise characteristic representation of the remote sensing images. Long et al applied the full convolution technique to the semantic segmentation domain for the first time in 2015, and greatly improved the overall domain in terms of accuracy for the first time, and the way of discarding the full connected layer was continuously learned and referred by the later model. In 2015, U-Net, FCN, Seg-Net, Deconv-Net, Deeplabv1 and Parse-Net are layered like spring bamboo shoots for a year, so that the applicability of semantic segmentation is continuously improved; during this period, many new technologies are created, such as skip link of Olaf et al, unpooling of Vijay et al, deconv of hyeonwood, and cavity convolution of the Deeplab series, which is most popular in body fluid; in 2017, the authors of Non-Local successfully applied the idea of an attention mechanism in NLP to the field of semantic segmentation. The appearance of the attention mechanism provides a new research idea for the semantic segmentation model, and the model is derived on the basis of the attention mechanism.

When a common semantic segmentation model is used for carrying out prediction segmentation on an image, the boundary of a prediction region is easy to be fuzzy or disordered and even wrongly classified in many times, and the phenomenon is that the context relation between each region is not learned in the stage of learning and extracting the image features by the model; and attention is paid to a mechanism which can effectively solve the problem. In the semantic segmentation field, many segmentation models derived based on an attention mechanism exist, for example, CCNet learns the characteristics of a characteristic cross backbone in an emphatic mode by designing a Criss-cross segmentation module, and a guide model learns the long-range relation of the characteristics to a certain extent; OCRnet enables a model to learn a feature diagram with enhanced object features by designing an object context module; however, the semantic segmentation models do not learn the context relationship between the regions, so that the situations of disordered boundary pixel classification and wrong boundary classification still exist, and the situation can be effectively solved by arranging a module for guiding the model to learn the context relationship of the regions.

According to the method, the segmentation precision of the model is enhanced by considering the context relation among the regions, the mode of the attention mechanism is used, the characteristic enhancement and the mode of the attention mechanism guiding the model learning are combined with each other, and the remote sensing image semantic segmentation method based on the region context relation module is provided, so that the precision of the model can be improved to a certain extent, and the model can more accurately classify the boundary among the regions. The design of the attention mechanism not only has high training speed, but also occupies little memory of the model. The RC-Module is a plug-and-play Module, can be combined with any semantic segmentation model and serves as an enhancement Module of the model for regional context characteristics, and therefore the accuracy of the model is improved. A pointedwise spatial attribute module considering the context relationship between a point and the point is designed by Hengshang Zhao from the perspective of information flow and is used for guiding the influence relationship between all pixels in a model learning image, the integrity between areas is not considered, each pixel is considered independently, and the prediction result of a salt and pepper mode is easy to form. And the OCR-Net adopts a mode of enhancing object characteristics, so that the model conducts guided learning on the characteristics of the object, the characteristics in an object range are taken as the object characteristics, the salt and pepper effect is solved, and the context relationship among all the regions is not considered. The remote sensing image semantic segmentation method based on the RC-module carries out semantic segmentation on the remote sensing image and effectively learns the regional context characteristics of the image.

Disclosure of Invention

A semantic segmentation method based on a region context module is characterized in that: the method comprises the following steps:

s1: enhancing the remote sensing image;

s2: constructing an RC-Module;

s3: establishing a remote sensing image semantic segmentation model RC-Net based on RC-Moudele;

s4: MIOU test and evaluation.

Optionally, the S1 specifically includes:

s11: randomly cutting the picture to generate additional data sets with the same number as the original data sets, adding the additional data sets into the original data sets, and simultaneously training the model;

s12: and selecting an image enhancement mode for the characteristics of each category of the data set, reducing the color dithering range to 0.01 in the color dithering process if grasslands, lands or other similar objects which are extremely sensitive to color characteristics exist in the image, respectively setting the saturation, the chromaticity and the contrast dithering range of the image to be 0.2, and generating the images with the same quantity as the S11 step to replace the original data set.

S13: the data sets were randomly flipped horizontally and vertically to generate the same number of data sets as S12.

S14: the data sets were randomly rotated by a limited rotation range of 30 degrees to generate the same number of data sets as in S13.

S15: in S14, gaussian noise and salt-and-pepper noise are added to each image.

Optionally, the S2 specifically includes:

a semantic segmentation basic framework is provided with a feature extractor, namely a Backbone which is composed of a series of convolution and pooling operations, an image is subjected to feature extraction through the Backbone and is integrated into a P, and the first step of a region context module is to generate a region rough region R on the basis of the feature P _soft The calculation formula is as follows:

where x represents the original image, K represents the number of classes, f represents a convolution operation, and r represents the coarse region feature of the corresponding class.

The RC-module designs an autocorrelation module by utilizing the theory of a self-attention mechanism on the basis of R, and is used for calculating the correlation W between the regions _ij ：

Wherein wij represents the influence factor of the jth area on the ith area

At the same time, the pixel feature P and the roughness region R are integrated _sot Obtaining feature of each region _soft-region :

(feature _soft-region ) _i ＝unsqueeze(-1)(R_T(R _soft )*R _T ′(P))) _i ,(i∈(0,K))

Wherein unsqueeze represents the addition of a new dimension at a specified position, and R _ T is an abbreviation for reshape and transpose; feature _soft-region Is a characteristic diagram of N C K1, N represents the number of pictures, C represents the number of characteristic channels, and K represents the number of areas;

relating each region to W _ij As weights for the original coarse regionFeature _soft-region Carrying out feature enhancement of regional relevance to obtain regional feature _ R with enhanced regional context features:

featur_R＝W*feature _soft-region

the RC-Module designs a region context learning Module by using the idea of attention mechanism, and finally, region features of enhanced region context features are combined with pixel-level features to form an integrated feature _ region of the features:

feature _region ＝R_T ¹ (R_T ² (P)*R_T(feature_R))

and finally, adopting the most common skip link method to link the feature integrated with the pixel feature to obtain an enhanced feature F passing through an RC-Module, so that the final region context Module has the calculation formula as follows:

F＝cat(feature_region||P)

optionally, the S3 specifically includes:

DeeplabV3 is a multi-scale model which is verified to have a very good effect, preliminarily integrates multi-scale features of an image by a method of a plurality of different void convolution rates through an ASPP structure, simultaneously adopts a ParseNet method, and uses adaptive global pooling globally to obtain global information, the Deeplabv3 model is an effective model which considers multi-scale and a certain global context relationship at the same time, and the model adopts DeeplabV3 as a feature extractor-Back bone of the model, wherein the feature calculation formula of ASPP is as follows:

where Yi represents the output of the ASPP module, F represents different convolution operations performed according to different D, D is a set of void rates, ASPP achieves the purpose of considering multi-scale information by gathering information of void rates of different sizes, and its commonly used D includes 1, 6, 12, and 18;

after the image is integrated by a feature extractor Deeplabv3, the image is received and the RC-Module is received to carry out context relationship integration of features, and finally a prediction result is obtained by a Decoder.

The Decoder is composed of two depth separable convolutions of 3x3 and a common convolution of 1x1, and the computation complexity of the common Decoder is reduced by utilizing the characteristics of the depth separable convolutions; the parameters of the common convolutional layer are calculated as follows:

P＝K*2xC _in *C _out

wherein P represents the total parameter, K represents the convolution kernel size, and a square convolution kernel is used by default; c represents the dimension of the image;

the parameter calculation formula for the depth separable convolution is as follows:

P＝K*2xC _in +C _in *C _out

it is clear that the amount of parameters of the deep separable convolution is greatly reduced, and the computational complexity is also reduced from the far-gate O (Cin × Cout) to O (Cin + Cout); the decoupling operation of the image channel greatly reduces the calculated amount of the model.

Optionally, the S4 specifically includes:

the calculation formula of Miou (mean intersection over Union) is as follows:

wherein p is _ij Representing the true value i, predicted as the number of j, and K +1 is the number of classes (including empty classes). p is a radical of _ii Is a true quantity. p is a radical of formula _ij 、p _ji False positive and false negative are indicated, respectively.

MIOU is the extension of multiple categories of IOU, IOU is a kind of measurement to calculate two sets of similarity, because of the particularity of the task of semantic segmentation, under the situation of using pixel precision, FP quantity and FN quantity are easy to appear to dominate the whole pixel precision, thus cause to make wrong estimation to model precision, and MIOU will not; MIOU is the most widely applied evaluation label in the semantic segmentation field, so in MIOU evaluation, MIOU is used as a measurement result of precision.

According to the method, the segmentation precision of the model is enhanced by considering the context relation among the regions, the mode of the attention mechanism is used, the characteristic enhancement and the mode of the attention mechanism guiding the model learning are combined with each other, and the remote sensing image semantic segmentation method based on the region context relation module is provided, so that the precision of the model can be improved to a certain extent, and the model can more accurately classify the boundary among the regions. The design of the attention mechanism not only has high training speed, but also occupies little memory of the model. The RC-Module is a plug-and-play Module, can be combined with any semantic segmentation model and serves as an enhancement Module of the model for regional context characteristics, and therefore the accuracy of the model is improved. A pointwise spatial attribute module considering the context relationship between points and points is designed by Hengshang Zhao from the information flow angle and is used for guiding a model to learn the influence relationship among all pixels in an image, the integrity among regions is not considered, each pixel is considered independently, and the prediction result of a salt-pepper mode is easy to form. And the OCR-Net adopts a mode of enhancing object characteristics, so that the model conducts guided learning on the characteristics of the object, the characteristics in an object range are taken as the object characteristics, the salt and pepper effect is solved, and the context relationship among all the regions is not considered. The remote sensing image semantic segmentation method based on the RC-module carries out semantic segmentation on the remote sensing image and effectively learns the regional context characteristics of the image.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description.

Drawings

Fig. 1 is a schematic diagram of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustration only and not for the purpose of limiting the invention, shown in the drawings are schematic representations and not in the form of actual drawings; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Fig. 1 is a schematic diagram of the present invention.

1 technical process and method

1.1 remote sensing image enhancement technical process

The remote sensing image enhancement technology used by the invention consists of five steps 1, and the specific execution flow is as follows.

The process of remote sensing image enhancement is as follows: (1) randomly cutting the pictures to generate additional data sets with the same number as the original data sets, adding the additional data sets into the original data sets, and simultaneously training the model; (2) and selecting an image enhancement mode for the characteristics of each category of the data set, reducing the color dithering range to 0.01 in the color dithering process if grasslands, lands or other similar objects which are extremely sensitive to color characteristics exist in the image, respectively setting the saturation, the chromaticity and the contrast dithering range of the image to be 0.2, and generating the images with the same quantity as the S11 step to replace the original data set. (3) The data sets are randomly flipped horizontally and vertically to generate the same number of data sets as S12. (4) The data sets are randomly rotated by a limited rotation range of 30 degrees to generate the same number of data sets as in S13. (5) Gaussian noise and salt and pepper noise are added to each image in S14.

1.2 technical process of RC-Module

The RC-module provided by the invention is constructed on the basis of a self-attention mechanism and a correlation attention mechanism.

Wherein wij represents the influence factor of the jth area on the ith area

relating each region to W _ij As weights for the original coarse area feature _soft-region Carrying out feature enhancement of regional relevance to obtain regional feature _ R with enhanced regional context features:

featur_R＝W*feature _soft-region

feature _region ＝R_T ¹ (R_T ² (P)*R_T(feature_R))

F＝cat(feature_region||P)

1.3 technical Process of RC-Net

DeeplabV3 is a multi-scale model which is verified to have a very good effect, preliminarily integrates multi-scale features of an image by a method of a plurality of different void convolution rates through an ASPP structure, simultaneously adopts a ParseNet method, and uses adaptive global pooling globally to obtain global information, and the Deeplabv3 model is an effective model which considers multi-scale and a certain global context relationship at the same time, so that the model adopts DeeplabV3 as a feature extractor-Back bone of the model, wherein the feature calculation formula of ASpp is as follows:

where Yi represents the output of the ASPP module, F represents different convolution operations performed according to different D, D is a set of void rates, ASPP achieves the purpose of considering multi-scale information by gathering information of void rates of different sizes, and D, which is commonly used, includes 1, 6, 12, 18;

after the image is integrated by a feature extractor Deeplabv3, the image is received and the context relation of features is integrated in an RC-Module, and finally a prediction result is obtained by a Decoder.

The ecoder is composed of two depth separable convolutions of 3x3 and a common convolution of 1x1, and the computation complexity of the common decoder is reduced by utilizing the characteristic of the depth separable convolutions; the parameters of the common convolutional layer are calculated as follows:

P＝K*2xC _in *C _out

P＝K*2xCin+Cin*Cout

1.4MIOU test and evaluation.

The calculation formula of Miou (mean intersection over Union) is as follows:

wherein p is _ij Representing the true value i, predicted as the number of jThe quantity, K +1, is the number of classes (including empty classes). pi is the true number. p is a radical of _ij 、p _ji False positive and false negative are indicated, respectively.

2 summary of the invention

The invention provides a plug-and-play RC-Module, and an RCNet for learning the context relationship of a region is designed on the basis of Deeplabv 3; RCNet is a derivation of attention mechanism, and combines enhanced features for final semantic segmentation by separately designing region features and enhancing region correlation features; RCNet is another mapping and development of attention mechanism in the semantic segmentation field, and it is believed that RCNet not only can be widely used in the field of remote sensing images, but also can be widely used in other fields in the future research process.

Claims

1. A semantic segmentation method based on a region context module is characterized in that: the method comprises the following steps:

s1: enhancing the remote sensing image;

s2: constructing an RC-Module;

s3: establishing a remote sensing image semantic segmentation model RC-Net based on RC-Module;

s4: MIOU inspection and evaluation;

the S1 specifically includes:

s12: selecting an image enhancement mode for the characteristics of each category of the data set, reducing the color dithering range to 0.01 in the color dithering process if grasslands, lands or other similar objects which are extremely sensitive to color characteristics exist in the image, respectively setting the dithering ranges of the saturation, the chromaticity and the contrast of the image to be 0.2, and generating images with the same quantity as that of the S11 to replace the original data set;

s13: randomly turning the data sets horizontally and vertically to generate data sets with the same number as that of S12;

s14: randomly rotating the data sets within the limit rotation range of 30 degrees to generate data sets with the same number as that of S13;

s15: adding Gaussian noise and salt and pepper noise to each image in S14;

the S2 specifically includes:

wherein x represents an original image, K represents the number of categories, f represents a convolution operation, and r represents the rough region feature of the corresponding category;

the RC-module utilizes the theory of a self-attention mechanism on the basis of R to design an autocorrelation module for calculating the correlation W between the regions _ij ：

Wherein w _ij Representing the influence factor of the jth region on the ith region

At the same time, the pixel feature P and the roughness region R are integrated _soft Obtaining feature of each region _soft-region ：

(feature _soft-region ) _i ＝unsqueeze(-1)(R_T(R _soft )*R _T (P))) _i ，i∈(0，K)

Wherein unsqueeze represents the addition of a new dimension at a specified position, and R _ T is an abbreviation for reshape and transpose; feature _soft-region The method comprises the following steps of A, obtaining a characteristic diagram of N C K1, wherein N represents the number of pictures, C represents the number of characteristic channels, and K represents the number of regions;

relating each region to W _ij As a weight to the original coarse region feature _soft-region Carrying out feature enhancement of regional relevance to obtain regional feature _ R with enhanced regional context features:

featur_R＝W*feature _soft-region

the RC-Module designs a region context learning Module by using the idea of an attention mechanism, and region features of enhanced region context features are combined with pixel-level features to form an integrated feature _ region of the features:

feature _region ＝R_T ¹ (R_T ² (P)*R_T(feature_R))

and (3) linking the characteristics of the characteristic integration with the pixel characteristics by adopting a skip linking method to obtain an enhanced characteristic F passing through an RC-Module, wherein the final region context Module has the calculation formula as follows:

F＝cat(feature_region||P)

the S3 specifically includes:

the DeeplabV3 is a multi-scale model, multi-scale features of an image are preliminarily integrated by a method of multiple different hole convolution rates through an ASPP structure, a ParseNet method is adopted, global information is obtained by globally using adaptive global pooling, the DeeplabV3 model is an effective model considering multi-scale and a certain global context relationship, the DeeplabV3 is adopted as a feature extractor-backhaul of the model, and a feature calculation formula of the ASPP is as follows:

where Yi represents the output of the ASPP module, F represents different convolution operations performed according to different D, D is a set of void rates, ASPP achieves the purpose of considering multi-scale information by gathering information of void rates of different sizes, and D is 1, 6, 12, and 18;

after the image is integrated by a feature extractor Deeplabv3, the image is received and the context relation of features is integrated in an RC-Module, and finally a prediction result is obtained by a Decoder;

P＝K*2xC _in *c _out

wherein P represents the total parameter, K represents the convolution kernel size, and a square convolution kernel is used; c represents the dimension of the image;

P＝K*2xC _in +C _in *C _out

the S4 specifically includes:

the formula for calculating the average cross-over ratio Miou is as follows:

wherein p is _ij Representing the real value i and the number predicted to be j, wherein K +1 is the number of categories and comprises empty categories; p is a radical of _ii Is a true quantity; p is a radical of _ij 、p _ji False positive and false negative are indicated, respectively.