CN112598675A - Indoor scene semantic segmentation method based on improved full convolution neural network - Google Patents

Indoor scene semantic segmentation method based on improved full convolution neural network Download PDF

Info

Publication number
CN112598675A
CN112598675A CN202011559942.5A CN202011559942A CN112598675A CN 112598675 A CN112598675 A CN 112598675A CN 202011559942 A CN202011559942 A CN 202011559942A CN 112598675 A CN112598675 A CN 112598675A
Authority
CN
China
Prior art keywords
layer
convolution
block
attention
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011559942.5A
Other languages
Chinese (zh)
Inventor
周武杰
岳雨纯
雷景生
强芳芳
周扬
邱薇薇
何成
王海江
马骁
郭翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lover Health Science and Technology Development Co Ltd
Original Assignee
Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lover Health Science and Technology Development Co Ltd filed Critical Zhejiang Lover Health Science and Technology Development Co Ltd
Priority to CN202011559942.5A priority Critical patent/CN112598675A/en
Publication of CN112598675A publication Critical patent/CN112598675A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an indoor scene semantic segmentation method based on an improved full convolution neural network. Firstly, constructing a convolutional neural network, wherein hidden layers of the convolutional neural network comprise 5 neural network blocks, 5 feature re-extraction convolutional layer blocks, 5 blocking attention convolution blocks, 12 fusion layers and 4 upsampling layers; inputting an original indoor scene image into a convolutional neural network for training to obtain a corresponding semantic segmentation prediction graph; calculating a loss function value between a set formed by a semantic segmentation prediction image corresponding to an original indoor scene image and a set formed by a single-hot coded image processed by a corresponding real semantic segmentation image to obtain an optimal weight vector and a bias term of the convolutional neural network classification training model; and inputting the indoor scene image to be subjected to semantic segmentation into the trained convolutional neural network classification training model to obtain a predicted semantic segmentation image. The method has the advantage of improving the semantic segmentation efficiency and accuracy of the indoor scene image.

Description

Indoor scene semantic segmentation method based on improved full convolution neural network
Technical Field
The invention relates to a semantic segmentation method for deep learning, in particular to an indoor scene semantic segmentation method based on an improved full convolution neural network.
Background
Image semantic segmentation is one of the most challenging tasks of computer vision, and plays a key role in applications such as automatic driving, medical image analysis, virtual reality, human-computer interaction, and the like. The core purpose of semantic segmentation is to provide a category label for each pixel point in a picture and determine which category the pixel point belongs to.
The image semantic segmentation can be divided into three types of full supervision, semi supervision and unsupervised from the perspective of supervised learning, but from the aspects of operability, theoretical application and the like, the current mainstream model mostly adopts the full supervision type, and the current mainstream model adopts the semi supervision type in a small number, so that the model is easier to realize and train at the same time.
In the aspect of model application, based on the appearance and development of the full convolution neural network, the application of the full convolution neural network has achieved excellent performance and segmentation effect in the image semantic segmentation task, but still has many defects and shortcomings, such as large parameter quantity, existence of a large amount of redundant information, insufficient feature expression extraction, and the like. Therefore, the image semantic segmentation model based on the full convolution neural network has a great promotion space, and a model with superior performance is proposed and trained aiming at the characteristics of the image, the structure of the model, the operation principle of a human visual system on the living body and the like, so that the current and future development targets of the image semantic segmentation field are provided.
Disclosure of Invention
In order to solve the problems in the background art, the invention provides an indoor scene semantic segmentation method based on an improved full convolution neural network, which obtains the model design direction and thought from the aspects of semantic feature expression, human visual system operation principle and the like, improves the traditional full convolution neural network and effectively improves the image segmentation performance.
The technical scheme adopted by the invention comprises the following steps:
step 1: selecting Q pairs of original indoor scene images and corresponding real semantic segmentation images, and forming a training set by all the original indoor scene images and the corresponding real semantic segmentation images; each pair of original indoor scene images comprises an original indoor scene color image and an original indoor scene depth image, and the real semantic segmentation images in the training set are processed into 41 independent thermal coding images by adopting an independent thermal coding technology (one-hot);
step 2: constructing a convolutional neural network classification training model: the convolutional neural network classification training model comprises an input layer, a hidden layer and an output layer; the input layer comprises a color image input layer and a depth image input layer; the hidden layer comprises a color image processing module and a depth image processing module; the color image processing module and the depth image processing module are symmetrical in structure and respectively comprise five neural network blocks, five feature re-extraction volume blocks and ten fusion layers; the hidden layer also comprises five block attention volume blocks, four upper sampling layers and two fusion layers;
and step 3: inputting the training set into the convolutional neural network classification training model in the step 2 for training, in the training process, performing iterative training processing each time to obtain 41 semantic segmentation predicted images corresponding to each pair of original indoor scene images, and calculating a loss function value between a set formed by the 41 semantic segmentation predicted images and a set formed by 41 one-hot coded images corresponding to real semantic segmentation images;
the loss function value is obtained by adopting a classified cross entropy;
and 4, step 4: repeating the step 3 for a total of V times to obtain Q multiplied by V loss function values; then finding out the minimum loss function value from the Q multiplied by V loss function values, and taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, thereby completing the training of the convolutional neural network classification training model;
and 5: and performing prediction processing on an indoor scene image to be predicted by using the convolutional neural network classification training model obtained after training, outputting and obtaining a corresponding prediction semantic segmentation image, and realizing indoor scene image semantic segmentation.
The step 2) is specifically as follows:
the color image input layer and the depth image input layer are respectively input into a first neural network block in the color image processing module and the depth image processing module;
the color image processing module and the depth image processing module have the same structure, and specifically comprise:
one output of the first neural network block is input into a first fusion layer through a first feature re-extraction convolution block, and the other output of the first neural network block is input into a first fusion layer; one output of the second neural network block is input into a third fusion layer through a second feature re-extraction convolution block, and the other output of the second neural network block is input into the third fusion layer; one output of the third neural network block is input into a fifth fusion layer through a third feature re-extraction convolution block, and the other output of the third neural network block is input into the fifth fusion layer; one output of the fourth neural network block is input into the seventh fusion layer through the fourth feature re-extraction convolution block, and the other output of the fourth neural network block is input into the seventh fusion layer; one output of the fifth neural network block is input into a ninth fusion layer through a fifth feature re-extraction convolution block, and the other output of the fifth neural network block is input into the ninth fusion layer; the two inputs of each fusion layer are fused in an element-by-element addition mode;
the output of the first fusion layer is respectively input into a first block attention volume block and a corresponding second fusion layer, the output of the third fusion layer is respectively input into a second block attention volume block and a corresponding fourth fusion layer, the output of the fifth fusion layer is respectively input into a third block attention volume block and a corresponding sixth fusion layer, the output of the seventh fusion layer is respectively input into a fourth block attention volume block and a corresponding eighth fusion layer, and the output of the ninth fusion layer is respectively input into a fifth block attention volume block and a corresponding tenth fusion layer;
two outputs of the first block attention volume block are respectively input into a second fusion layer of the color image processing module and a second fusion layer of the depth image processing module, two outputs of the second block attention volume block are respectively input into a fourth fusion layer of the color image processing module and a fourth fusion layer of the depth image processing module, two outputs of the third block attention volume block are respectively input into a sixth fusion layer of the color image processing module and a sixth fusion layer of the depth image processing module, two outputs of the fourth block attention volume block are respectively input into an eighth fusion layer of the color image processing module and a eighth fusion layer of the depth image processing module, and two outputs of the fifth block attention volume block are respectively input into a tenth fusion layer of the color image processing module and the tenth fusion layer of the depth image processing module;
the two inputs of the second fusion layer are fused in an element-by-element addition mode and then respectively input into an eleventh fusion layer and a corresponding second neural network block, the two inputs of the fourth fusion layer are fused in an element-by-element addition mode and then respectively input into a first up-sampling layer and a corresponding third neural network block, the two inputs of the sixth fusion layer are fused in an element-by-element addition mode and then respectively input into a second up-sampling layer and a corresponding fourth neural network block, and the two inputs of the eighth fusion layer are fused in an element-by-element addition mode and then respectively input into a third up-sampling layer and a corresponding fifth neural network block; the output of the tenth fusion layer is input into the fourth upsampling layer;
the two inputs of the eleventh fusion layer, the first up-sampling layer, the second up-sampling layer, the third up-sampling layer and the fourth up-sampling layer are fused in an element-by-element addition mode and then are all input into the twelfth fusion layer;
and all the inputs of the twelfth fusion layer are connected in a concatenate mode and then output through the output layer, and the output layer mainly comprises a convolution layer and a fifth upper sampling layer which are sequentially connected.
The corresponding representation previous input and subsequent output are located in the same color image processing module or the same depth image processing module.
The five neural network blocks adopt a MobileNet V2 network structure, the first neural network block adopts 1-4 layers (the number of repetition n is 1, 2 totally four layers) in MobileNet V2, the second neural network block adopts 5-7 layers (the number of repetition n is 3 totally three layers) in MobileNet V2, the third neural network block adopts 8-11 layers (the number of repetition n is 4 totally four layers) in MobileNet V2, the fourth neural network block adopts 12-14 layers (the number of repetition n is 3 totally three layers) in MobileNet V2, and the fifth neural network block adopts 15-17 layers (the number of repetition n is 3 totally three layers) in MobileNet V2.
Each feature re-extraction convolution block consists of four re-extraction modules which are connected in sequence, and each re-extraction module comprises a convolution layer, a normalization layer and an activation layer which are connected in sequence; the activation mode of all the activation layers adopts ReLU 6; the step length of all the convolution layers is 1; the number of convolution kernels of the convolution layer in each re-extraction module is the same as the standardized parameters of the standardized layer; the convolution kernels of the convolution layers in the first re-extraction module, the third re-extraction module and the fourth re-extraction module are the same in size, and the convolution kernel of the convolution layer in the second re-extraction module is half of the convolution kernel of the convolution layer in the first re-extraction module; the convolution kernels of the convolution layers in the first re-extraction module and the third re-extraction module are both 3x 3, and the convolution kernels of the convolution layers in the second re-extraction module and the fourth re-extraction module are both 1x 1; the expansion factor of the convolution layer in the first re-extraction module is 1, and the expansion factor of the convolution layer in the third re-extraction module is 2; the zero padding parameter of the convolution layer in the first re-extraction module is 1, the zero padding parameter of the convolution layer in the third re-extraction module is 2, and the zero padding parameters of the convolution layers in the second re-extraction module and the fourth re-extraction module are both 0.
The convolution kernel size of the convolution layer of the first re-extraction module in the first feature re-extraction volume block is 24, the convolution kernel size of the convolution layer of the first re-extraction module in the second feature re-extraction volume block is 32, the convolution kernel size of the convolution layer of the first re-extraction module in the third feature re-extraction volume block is 64, the convolution kernel size of the convolution layer of the first re-extraction module in the fourth feature re-extraction volume block is 96, and the convolution kernel size of the convolution layer of the first re-extraction module in the fifth feature re-extraction volume block is 160;
grouping parameters of four sequentially arranged convolutional layers in the first feature re-extraction convolutional block are respectively 24, 12, 24 and 24, grouping parameters of four sequentially arranged convolutional layers in the second feature re-extraction convolutional block are respectively 32, 16 and 32, grouping parameters of four sequentially arranged convolutional layers in the third feature re-extraction convolutional block are respectively 64, 32 and 64, grouping parameters of four sequentially arranged convolutional layers in the fourth feature re-extraction convolutional block are respectively 64, 48 and 99, and grouping parameters of four sequentially arranged convolutional layers in the fifth feature re-extraction convolutional block are respectively 96, 80 and 160.
Each of the block attention convolution blocks comprises a block layer, a channel attention layer and a space size attention layer, wherein the input of the block attention convolution block is input into the channel attention layer and the space size attention layer after passing through the block layer, the input of the block attention convolution block is multiplied by the output of the channel attention layer and the output of the space size attention layer, and the multiplied results are added to be used as the output of the block attention convolution block.
The blocking layer adopts a split function carried by the pytorech, and the parameter is half of the number of channels corresponding to the input characteristic diagram of the blocking attention volume block;
the channel attention layer comprises an adaptive maximum pooling layer, a channel attention first convolution layer, a channel attention second convolution layer and a channel attention activation layer which are sequentially connected; the maximum pooling parameter of the adaptive maximum pooling layer is 1; the convolution kernels of the channel attention first convolution layer and the channel attention second convolution layer are both 1x1 in size, the step length is 1, the bias terms are False, the grouping parameters are the same, and the number of convolution kernels in the channel attention first convolution layer is half of that of the channel attention second convolution layer;
the number of convolution kernels of the channel attention first convolution layer of the channel attention layer in the first block attention convolution block is 12, and the grouping parameter is 12; the number of convolution kernels of the channel attention first convolution layer of the channel attention layer in the second block attention convolution block is 16, and the grouping parameter is 16; the number of convolution kernels of the first convolution layer of the channel attention layer in the third block attention convolution block is 32, and the grouping parameter is 32; the number of convolution kernels of the first convolution layer of the channel attention layer in the fourth block attention convolution block is 48, and the grouping parameter is 48; the number of convolution kernels of the first convolution layer of the channel attention layer in the fifth block attention convolution block is 80, and the grouping parameter is 80;
the spatial dimension attention layer comprises a per-channel maximization layer, a spatial dimension attention convolution layer and a spatial dimension attention activation layer which are sequentially connected; adopting a max function carried by the pytorch according to a channel maximization layer; the convolution kernel size of the spatial dimension attention convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, the expansion factor is 2, the zero padding parameter is 2, the grouping parameter is 1, and the bias term is False;
the channel attention layer and the space size attention layer adopt an activation mode which is a Sigmoid function.
The upsampling layer adopts an upsampling Biliner 2d function carried in the pytorech, the function parameter of the first upsampling layer is 2, and the function parameters of the second upsampling layer, the third upsampling layer, the fourth upsampling layer and the fifth upsampling layer are 4; the convolution kernel size of the convolution layer in the output layer is 1 multiplied by 1, the number of the convolution kernels is 41, the step length is 1, and the bias term is False.
The input end of the color image input layer receives an indoor scene color image, the input end of the depth image input layer receives an indoor scene depth image, and the output of the output layer is 41 semantic segmentation predicted images corresponding to the indoor scene image input by the input layer.
The invention has the beneficial effects that:
1) the method starts from effectively extracting image channel and space semantic information, reduces information loss in the gradient propagation process as much as possible, and designs a module called a feature re-extraction module under the condition of not obviously increasing the number of model parameters. The modular structure is approximately in a uniform column shape and comprises a 1x1 convolution and two 3x 3 convolution neural networks, wherein one 3x 3 convolution neural network adopts the expansion convolution arrangement. To promote efficient propagation of the gradient, two 3x 3 convolutional neural networks are located at both ends of the module, and 1x1 convolutional is located inside the module.
2) The method of the invention is based on the operation principle of a human eye vision system, and combines the attention mechanism of human eye vision and block convolution to design a module called a block attention convolution block. The module comprises an attention mechanism module based on a channel and a space, an input convolution feature map is firstly divided into two parts according to the channel by using a block convolution principle, one part is used for carrying out feature screening based on the channel attention mechanism, the other part is used for carrying out feature screening based on the space attention mechanism, and finally the two parts of screened features are multiplied by the input convolution feature map, so that redundant features are effectively reduced.
3) The method combines the design of the two modules, takes the MobileNet V2 as a main frame, and designs the network model MSCNet. Experiments show that the model has fewer parameters and higher speed, only needs about 43 seconds for each training turn, simultaneously keeps higher precision, and is a lightweight network suitable for a mobile terminal.
Drawings
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2a is the 1 st original color image of an indoor scene of the same scene;
FIG. 2b is the 1 st original indoor scene depth image of the same scene;
FIG. 2c is a predicted semantic segmentation image obtained by predicting the original indoor scene image shown in FIGS. 2a and 2b by the method of the present invention;
FIG. 3a is the 2 nd original color image of the indoor scene of the same scene;
FIG. 3b is the 2 nd original indoor scene depth image of the same scene;
FIG. 3c is a predicted semantic segmentation image obtained by predicting the original indoor scene image shown in FIGS. 3a and 3b by the method of the present invention;
FIG. 4a is the 3 rd original color image of the indoor scene of the same scene;
FIG. 4b is the 3 rd original indoor scene depth image of the same scene;
FIG. 4c is a predicted semantic segmentation image obtained by predicting the original indoor scene image shown in FIGS. 4a and 4b by the method of the present invention;
FIG. 5a is a4 th original color image of an indoor scene of the same scene;
FIG. 5b is the 4 th original indoor scene depth image of the same scene;
fig. 5c is a predicted semantic segmentation image obtained by predicting the original indoor scene image shown in fig. 5a and 5b by using the method of the present invention.
FIG. 6 is a block diagram of a blocked attention volume block of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the embodiments.
The invention provides an indoor scene semantic segmentation method based on an improved full convolution neural network, the overall implementation block diagram of which is shown in figure 1 and comprises two processes, namely a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1_ 1: respectively selecting Q pairs of original indoor scene RGB color images and Depth map images and real semantic segmentation images corresponding to each pair of original indoor scene images, forming a training set, and recording the Q-th pair of original indoor scene images in the training set as { RGB (red green blue) imagesq(i,j),Depthq(i, j) }, set of training sets { RGBq(i,j),Depthq(i, j) } and the corresponding real semantic segmentation image are recorded as
Figure BDA0002860067140000081
Then, the real semantic segmentation images corresponding to each pair of original indoor scene images in the training set are processed into 41 independent thermal coding images by adopting the existing independent thermal coding technology (one-hot), and the 41 independent thermal coding images are obtained
Figure BDA0002860067140000082
The processed set of 41 one-hot coded images is denoted as
Figure BDA0002860067140000083
The indoor scene image comprises an RGB color image and a Depth map, Q is a positive integer, Q is more than or equal to 200, if Q is 794, Q is a positive integer, Q is more than or equal to 1 and less than or equal to Q, i is more than or equal to 1 and less than or equal to W, j is more than or equal to 1 and less than or equal to H, and W represents { RGB (red, green and blue) }q(i,j),Depthq(i, j) } width of the color map RGB and the depth map Dept, H denotes { RGBq(i,j),Depthq(i j) } the height of the color map RGB and Depth map Depth, e.g. taking W480, H640, RGBq(i,j),Depthq(i, j) respectively represent { RGB }q(i,j),DepthqThe pixel value of the pixel point with the coordinate position (i, j) in (i, j),
Figure BDA0002860067140000084
to represent
Figure BDA0002860067140000085
The middle coordinate position is the pixel value of the pixel point of (i, j); in this case, 1448 images in the training set of the indoor scene image database NYUv2 are directly selected as the original indoor scene image, and the purpose is to further selectTraining is facilitated by reducing the size of each image to 480 a width and 640 a height. In addition, in order to effectively relieve the problem of model overfitting, three data enhancement methods of random clipping, random horizontal turning and random scaling are adopted to expand data in a training set.
Step 1_ 2: constructing a convolutional neural network classification training model: the convolutional neural network comprises an input layer, a hidden layer and an output layer; the hidden layer comprises a1 st neural network block, a2 nd neural network block, a3 rd neural network block, a4 th neural network block, a5 th neural network block, a1 st feature re-extraction rolling block, a2 nd feature re-extraction rolling block, a3 rd feature re-extraction rolling block, a4 th feature re-extraction rolling block, a5 th feature re-extraction rolling block, a1 st block attention rolling block, a2 nd block attention rolling block, a3 rd block attention rolling block, a4 th block attention rolling block, a5 th block attention rolling block, a1 st fusion layer, a2 nd fusion layer, a3 rd fusion layer, a4 th fusion layer, a5 th fusion layer, a 6 th fusion layer, a 7 th fusion layer, an 8 th fusion layer, a 9 th fusion layer, a 10 th layer, a 11 th fusion layer, a 12 th fusion layer, a 7 th fusion layer, a 8 th fusion layer, a 9 th fusion layer, a 10 th layer, a 12 th fusion layer, The 1 st upsampling layer, the 2 nd upsampling layer, the 3 rd upsampling layer, and the 4 th upsampling layer.
For an input layer, the invention has two inputs, namely a color image RGB input layer and a Depth image Depth input layer, wherein a color image RGB input end of the input layer receives an R channel component, a G channel component and a B channel component of an original RGB input image, a Depth image Depth input end of the input layer receives a single channel component of the original Depth input image, a color image RGB output end of the input layer outputs the R channel component, the G channel component and the B channel component of the original input image to a hidden layer, and a Depth image Depth output end of the input layer outputs the single channel component of the original input image to the hidden layer; wherein, the input end of the input layer is required to receive an original input image with the width of 480 and the height of 640; furthermore, the convolutional neural network structure whose input is the color map RGB is symmetrical to the convolutional neural network structure whose input is the Depth map Depth.
For five neural network blocks, the five neural network blocks adopt a MobileNet V2 network structure, the 1 st neural network block adopts 1-4 layers (the number of repetition n is 1, 2 is four layers in total) in the MobileNet V2, the 2 nd neural network block adopts 5-7 layers (the number of repetition n is 3 is three layers in total) in the MobileNet V2, the 3 rd neural network block adopts 8-11 layers (the number of repetition n is 4 is four layers in total) in the MobileNet V2, the 4 th neural network block adopts 12-14 layers (the number of repetition n is 3 is three layers in total) in the MobileNet V2, and the 5 th neural network block adopts 15-17 layers (the number of repetition n is 3 is three layers in total) in the MobileNet V2.
The method comprises the following steps that a color image RGB input end of a1 st neural network block receives an R channel component, a G channel component and a B channel component of an original input image output by an output end of an input layer, a Depth image Depth input end receives a single channel component of the original input image output by the output end of the input layer, the color image RGB output end of the 1 st neural network block outputs 24 characteristic images, a formed set is recorded as R1, a Depth image Depth output end of the 1 st neural network block outputs 24 characteristic images, and a formed set is recorded as D1; each of the feature maps R1 and D1 has a width of
Figure BDA0002860067140000101
Has a height of
Figure BDA0002860067140000102
The RGB output end of the color image of the 2 nd neural network block outputs 32 characteristic images, and the formed set is marked as R2; the Depth map Depth input end of the 2 nd neural network block receives all feature maps in DF1, and the Depth map Depth output end of the 2 nd neural network block outputs 32 feature maps, and the formed set is marked as D2; the width of each of the feature maps R2 and D2
Figure BDA0002860067140000103
All heights are
Figure BDA0002860067140000104
The RGB input end of the color image of the 3 rd neural network block receives all the characteristic maps in the RF2, the RGB output end of the color image of the 3 rd neural network block outputs 64 characteristic maps, and 64 characteristic maps are outputThe set of the frame feature maps is denoted as R3; a Depth map Depth input end of the 3 rd neural network block receives all feature maps in DF2, a Depth map Depth output end of the 3 rd neural network block outputs 64 feature maps, and a set formed by the 64 feature maps is recorded as D3; the width of each feature map in R3 and D3
Figure BDA0002860067140000105
All heights are
Figure BDA0002860067140000106
The color image RGB input end of the 4 th neural network block receives all the characteristic maps in the RF3, the color image RGB output end of the 4 th neural network block outputs 96 characteristic maps, and a set formed by the 96 characteristic maps is recorded as R4; the Depth map Depth input end of the 4 th neural network block receives all feature maps in DF3, the Depth map Depth output end of the 4 th neural network block outputs 96 feature maps, and a set formed by the feature maps is recorded as D4; the width of each of the feature maps R4 and D4
Figure BDA0002860067140000107
All heights are
Figure BDA0002860067140000108
The RGB input end of the color image of the 5 th neural network block receives all the characteristic maps in the RF4, the RGB output end of the color image of the 5 th neural network block outputs 160 characteristic maps, and the set formed by the 160 characteristic maps is recorded as R5; the Depth map Depth input end of the 5 th neural network block receives all feature maps in DF4, the Depth map Depth output end of the 5 th neural network block outputs 160 feature maps, and a set formed by the 160 feature maps is recorded as D5; each of the feature maps R5 and D5 has a width of
Figure BDA0002860067140000111
Has a height of
Figure BDA0002860067140000112
The volume block is re-extracted for feature 1,the device comprises a fifty-first rolling layer, a fifty-first standardized layer, a fifty-first active layer, a fifty-second rolling layer, a fifty-second standardized layer, a fifty-second active layer, a fifty-third rolling layer, a fifty-third standardized layer, a fifty-third active layer, a fifty-fourth rolling layer, a fifty-fourth standardized layer and a fifty-fourth active layer which are sequentially arranged; the input end of the 1 st feature re-extraction volume block receives all feature maps in R1 and D1, the RGB output end of the color map of the 1 st feature extraction block outputs 24 feature maps, the formed set is recorded as RS1, the Depth map Depth output end outputs 24 feature maps, and the formed set is recorded as DS 1; wherein, the convolution kernel size of the fifty-th convolution layer is 3 × 3, the number of convolution kernels is 24, the step size (stride) is 1, the dilation factor (scaling) is 1, the padding parameter is 1, the grouping parameter is 24, and the fifty-th layer normalization parameter is 24; the convolution kernel size of the fifty-second convolution layer is 1x1, the number of convolution kernels is 12, the step size (stride) is 1, the padding parameter is 0, the grouping parameter is 12, and the fifth-twelfth layer normalization parameter is 12; the convolution kernel size of the fifty-third convolution layer is 3 × 3, the number of convolution kernels is 24, the step size (stride) is 1, the dilation factor (scaling) is 2, the padding parameter is 2, the grouping parameter is 24, and the fifty-third layer normalization parameter is 24; the convolution kernel size of the fifty-fourth convolution layer is 1 × 1, the number of convolution kernels is 24, the stride is 1, the padding parameter is 0, the grouping parameter is 24, and the fifty-fourth layer normalization parameter is 24. All the activation layers are activated in a way of 'ReLU 6', and the width of each feature map in RS1 and DS1 is
Figure BDA0002860067140000113
Has a height of
Figure BDA0002860067140000114
For the first blending layer, the color image RGB input end of the 1 st blending layer receives all the feature maps in R1 and all the feature maps in RS1, the R1 and the RS1 are blended in the existing add (element-by-element addition) mode to obtain a set RA1, and the color image RGB input end of the 1 st blending layer receives the color image RGB input end of the R1 and the RS1An output RA 1; the Depth map Depth input end of the 1 st fusion layer receives all feature maps in D1 and all feature maps in DS1, the 1 st connection layer fuses D1 and DS1 in an existing add (element-by-element addition) mode to obtain a set DA1, and the Depth map Depth output end of the 1 st fusion layer outputs DA 1; wherein, the total breadth of the characteristic maps contained in RA1 and DA1 is 24, and the breadth of each characteristic map in RA1 and DA1 is 24
Figure BDA0002860067140000121
All heights are
Figure BDA0002860067140000122
For the 1 st blocked attention volume block, the structure is shown in fig. 6, and the first blocked attention volume block is composed of a first blocked layer, a first channel attention layer and a first spatial dimension attention layer which are arranged in sequence; the method comprises the steps that a color map RGB input end of a1 st block attention volume block receives all feature maps in RA1, a Depth map Depth input end of the 1 st block attention volume block receives all feature maps in DA1, all feature maps in RA1 and DA1 are input into a first block attention volume block respectively, all input feature maps are divided into two parts according to channels through a first block layer, one part is used as input of a channel attention layer, the other part is used as input of a space size attention layer, finally results of two-side attention processing are multiplied by original input feature maps respectively, multiplication results are added, a color map output end of the 1 st block attention volume block outputs 24 feature maps, a set of the feature maps is recorded as RT1, a Depth map Depth output end of the 1 st block attention volume block outputs 24 feature maps, and a set of the feature maps is recorded as DT 1; the first partitioning layer adopts a pitch function carried by the pytorech, and the parameters are half of the original input characteristic diagram and the number of corresponding channels; the first channel attention layer comprises a first adaptive maximum pooling layer, a first channel attention first convolution layer, a first channel attention second convolution layer and an active layer, wherein the first adaptive maximum pooling parameter is 1, the first channel attention first convolution layer convolution kernel size is 1x1, the number of convolution kernels is 12, the step size (stride) is 1, the grouping (groups) parameter is 12, and the bias term (bias) isFalse; the first channel attention second convolution layer convolution kernel size is 1x1, the number of convolution kernels is 24, the step size (stride) is 1, the grouping (groups) parameter is 12, and the bias term (bias) is False; the first channel attention activation layer activation mode is 'Sigmoid'; the first spatial dimension attention layer comprises a first per-channel maximization layer, a first spatial dimension attention convolution layer and an activation function, wherein the first per-channel maximization layer adopts a max function of a pytorch self-contained body; the first spatial dimension attention convolution kernel has a size of 3 × 3, the number of convolution kernels is 1, a dilation factor (scaling) is 2, a zero padding (padding) parameter is 2, a grouping (groups) parameter is 1, a bias term (bias) is False, and the first spatial dimension attention activation layer activation manner is "Sigmoid". The width of each of the feature maps in RT1 and DT1
Figure BDA0002860067140000123
All heights are
Figure BDA0002860067140000124
For the second fusion layer, the color image RGB input end of the 2 nd fusion layer receives all the feature maps in DT1, and fuses RA1 and DT1 in the existing add (element-by-element addition) manner to obtain an aggregate RF1, and the color image RGB output end of the 2 nd fusion layer outputs RF 1; the Depth map Depth input end of the 2 nd fusion layer receives all feature maps in RT1 and all feature maps in DA1, the 2 nd fusion obtains a set DF1 by fusing RT1 and DA1 in the existing add (element-by-element addition) mode, and the Depth map Depth output end of the 2 nd fusion layer outputs DF 1; wherein, the total width of the characteristic maps contained in the RF1 and the DF1 is 24, and the width of each characteristic map in the RF1 and the DF1 is 24
Figure BDA0002860067140000131
All heights are
Figure BDA0002860067140000132
For the 2 nd feature re-extraction volume block, it is composed of the fifty-fifth convolution layer and the fifty-fifth normalization layer which are arranged in sequenceFifty-fifth active layer, fifty-sixth convolution layer, fifty-sixth normalization layer, fifty-sixth active layer, fifty-seventh convolution layer, fifty-seventh normalization layer, fifty-seventh active layer, fifty-eighth convolution layer, fifty-eighth normalization layer, and fifty-eighth active layer; the 2 nd feature re-extraction volume block color image RGB input end receives all feature images in R2, the 2 nd feature re-extraction volume block color image RGB output end outputs 32 feature images, and the formed set is recorded as RS 2; the Depth map Depth input end of the 2 nd feature re-extraction volume block receives all the feature maps in D2, the Depth map Depth output end of the 2 nd feature re-extraction volume block outputs 32 feature maps, and a set formed by the feature maps is recorded as DS 2; wherein, the convolution kernel size of the fifty-fifth convolution layer is 3 × 3, the number of convolution kernels is 32, the step size (stride) is 1, the dilation factor (scaling) is 1, the padding parameter is 1, the grouping parameter is 32, and the fifty-fifth layer normalization parameter is 32; the convolution kernel size of the fifty-sixth convolution layer is 1 × 1, the number of convolution kernels is 16, the step size (stride) is 1, the padding parameter is 0, the grouping parameter is 16, and the fifty-sixth layer normalization parameter is 16; the convolution kernel size of the fifty-seventh convolution layer is 3 × 3, the number of convolution kernels is 32, the step size (stride) is 1, the dilation factor (scaling) is 2, the padding parameter is 2, the grouping parameter is 16, and the fifty-seventh layer normalization parameter is 32; the convolution kernel size of the fifty-eighth convolutional layer is 1 × 1, the number of convolution kernels is 32, the step size (stride) is 1, the padding parameter is 0, the grouping parameter is 32, and the fifty-eighth layer normalization parameter is 32. All the activation layers are activated in a way of 'ReLU 6', and the width of each feature map in RS2 and DS2 is
Figure BDA0002860067140000133
Has a height of
Figure BDA0002860067140000134
For the third blending layer, the RGB input end of the color image of the 3 rd blending layer receives all the feature maps in R2 and all the feature maps in RS2, and the 3 rd blending layer passes through the existing add (element by element)Adding) mode to fuse R2 and RS2 to obtain a set RA2, and the output end of the color image RGB of the 3 rd fused layer outputs RA 2; the Depth map Depth input end of the 3 rd fusion layer receives all feature maps in D2 and all feature maps in DS2, the 2 nd connection layer fuses D2 and DS2 in the existing add (element-by-element addition) mode to obtain a set DA2, and the Depth map Depth output end of the 3 rd fusion layer outputs DA 2; wherein, the total breadth of the characteristic maps contained in RA2 and DA2 is 32, and the breadth of each characteristic map in RA2 and DA2 is 32
Figure BDA0002860067140000141
All heights are
Figure BDA0002860067140000142
For the 2 nd block attention volume block, a second block layer, a second channel attention layer and a second space size attention layer are sequentially arranged; the RGB input end of color image of 2 nd sub-block attention volume block receives all feature images in RA2, all feature images in RA2 are input into a second sub-block attention volume block, all input feature images are divided into two parts according to channels through a second sub-block layer, one part is used as the input of a channel attention layer, the other part is used as the input of a space size attention layer, finally, the results of two-side attention processing are multiplied with the original input feature images respectively, the multiplication results are added, 32 feature images are output from the RGB output end of color image of 2 nd sub-block attention volume block, and the formed set is marked as RT 2; the Depth map Depth input end of the 2 nd block attention convolution block receives all feature maps in DA2, all feature maps in DA2 are input into a second block attention convolution block, all input feature maps are divided into two parts according to channels through a second block layer, one part is used as the input of a channel attention layer, the other part is used as the input of a space size attention layer, finally, the results of two-side attention processing are multiplied by original input feature maps respectively and then the multiplication results are added, 32 feature maps are output from the Depth map Depth output end of the 2 nd block attention convolution block, and the formed set is marked as DT 2; wherein the second block layer adopts the split function of the pyrrchThe number of the parameters is half of the number of the original input feature map and the corresponding channels; the second channel attention layer comprises a second adaptive maximum pooling layer, a second channel attention first convolution layer, a second channel attention second convolution layer and an activation layer, wherein the second adaptive maximum pooling parameter is 1, the convolution kernel size of the second channel attention first convolution layer is 1x1, the number of convolution kernels is 16, the step length (stride) is 1, the grouping (groups) parameter is 16, and the bias term (bias) is False; the second channel attention second convolution layer convolution kernel size is 1x1, the number of convolution kernels is 32, the step length (stride) is 1, the grouping (groups) parameter is 16, and the bias term (bias) is False; the second channel attention activation layer activation mode is 'Sigmoid'; the second spatial dimension attention layer comprises a second per-channel maximization layer, a second spatial dimension attention convolution layer and an activation function, wherein the second per-channel maximization layer adopts a max function of a pytorch self-contained body; the second spatial dimension attention convolution kernel has a size of 3 × 3, the number of convolution kernels is 1, a dilation factor (scaling) is 2, a zero padding (padding) parameter is 2, a grouping (groups) parameter is 1, a bias term (bias) is False, and the second spatial dimension attention activation layer activation manner is "Sigmoid". The width of each of the feature maps in RT2 and DT2
Figure BDA0002860067140000151
All heights are
Figure BDA0002860067140000152
For the fourth fusion layer, the color image RGB input end of the 4 th fusion layer receives all the feature images in RA2 and DT2, the 4 th fusion layer fuses RA2 and DT2 in the existing add (element-by-element addition) mode to obtain an aggregate RF2, and the color image RGB output end of the 4 th fusion layer outputs RF 2; the Depth map Depth input end of the 4 th fusion layer receives all feature maps in RT2 and all feature maps in DA2, the 4 th fusion obtains a set DF2 by fusing RT2 and DA2 in the existing add (element-by-element addition) mode, and the Depth map Depth output end of the 4 th fusion layer outputs DF 2; wherein the total number of the feature maps contained in the RF2 and the DF2 is 32, and the width of each feature map in the RF2 and the DF2 is 32Is composed of
Figure BDA0002860067140000153
All heights are
Figure BDA0002860067140000154
For the 3 rd feature re-extraction volume block, the volume block consists of a fifty-ninth volume layer, a fifty-ninth normalization layer, a fifty-ninth active layer, a sixty volume layer, a sixty normalization layer, a sixty active layer, a sixty volume layer, a sixty-first normalization layer, a sixty-first active layer, a sixty-second volume layer, a sixty-second normalization layer and a sixty-second active layer which are sequentially arranged; the RGB input end of the color image of the 3 rd feature re-extraction volume block receives all the feature images in R3, the RGB output end of the color image of the 3 rd feature re-extraction volume block outputs 64 feature images, and the formed set is recorded as RS 3; the Depth map Depth input end of the 3 rd feature re-extraction volume block receives all feature maps in D3, the Depth map Depth output end of the 3 rd feature re-extraction volume block outputs 64 feature maps, and a set formed by the feature maps is recorded as DS 3; wherein, the convolution kernel size of the fifty-ninth convolution layer is 3 × 3, the number of convolution kernels is 64, the step size (stride) is 1, the dilation factor (scaling) is 1, the padding parameter is 1, the grouping parameter is 64, and the normalization parameter of the fifty-ninth layer is 64; the convolution kernel size of the sixteenth convolution layer is 1 × 1, the number of convolution kernels is 32, the step size (stride) is 1, the padding parameter is 0, the grouping parameter is 32, and the sixteenth layer normalization parameter is 32; the size of convolution kernel of sixty-th convolution layer is 3 × 3, the number of convolution kernels is 64, the step size (stride) is 1, the dilation factor (scaling) is 2, the padding parameter is 2, the grouping parameter is 32, and the sixty-th layer normalization parameter is 64; the convolution kernel size of the sixtieth convolutional layer is 1 × 1, the number of convolution kernels is 64, the stride (stride) is 1, the padding (padding) parameter is 0, the grouping (groups) parameter is 64, and the sixtieth layer normalization parameter is 64. All the activation layers are activated in a way of 'ReLU 6', and the width of each feature map in RS3 and DS3 is
Figure BDA0002860067140000161
Has a height of
Figure BDA0002860067140000162
For the fifth fusion layer, the color image RGB input end of the 5 th fusion layer receives all the feature maps in R3 and all the feature maps in RS3, the 5 th fusion layer fuses R3 and RS3 in the existing add (element-by-element addition) mode to obtain a set RA3, and the color image RGB output end of the 5 th fusion layer outputs RA 3; the Depth map Depth input end of the 5 th fusion layer receives all feature maps in D3 and all feature maps in DS3, the 3 rd fusion layer fuses D3 and DS3 in the existing add (element-by-element addition) mode to obtain a set DA3, and the Depth map Depth output end of the 5 th fusion layer outputs DA 3; wherein, the total breadth of the characteristic maps contained in RA3 and DA3 is 64, and the breadth of each characteristic map in RA3 and DA3 is 64
Figure BDA0002860067140000163
All heights are
Figure BDA0002860067140000164
For the 3 rd block attention volume block, the third block layer, the third channel attention layer and the third space size attention layer are sequentially arranged; all feature maps of RA3 in the 3 rd partitioned attention volume block are received by a color map RGB input end, a third partitioned attention volume block is input by all feature maps in RA3, all the input feature maps are divided into two parts according to channels through a third partitioned layer, one part is used as the input of a channel attention layer, the other part is used as the input of a space size attention layer, finally, the results of the two-side attention processing are multiplied by the original input feature maps respectively and then the multiplication results are added, 64 feature maps are output by a color map RGB output end of the 3 rd partitioned attention volume block, and a set formed by the 64 feature maps is marked as RT 3; the Depth map Depth input end of the 3 rd block attention volume block receives all feature maps of DA3 in the input end, all feature maps in DA3 are input into the third block attention volume block, and the input is firstly inputDividing some feature maps into two parts according to channels through a third block layer, wherein one part is used as the input of a channel attention layer, the other part is used as the input of a space size attention layer, finally, the results of attention processing at two sides are respectively multiplied by the original input feature maps and then the multiplication results are added, 64 feature maps are output by the Depth map Depth output end of a3 rd block attention volume block, and a set formed by the 64 feature maps is recorded as DT 3; the third partitioning layer adopts a pitch function carried by the pytorech, and the parameters are half of the original input characteristic diagram and the number of corresponding channels; the third channel attention layer comprises a third adaptive maximum pooling layer, a third channel attention first convolution layer, a third channel attention second convolution layer and an active layer, wherein a third adaptive maximum pooling parameter is 1, the convolution kernel size of the third channel attention first convolution layer is 1x1, the number of convolution kernels is 32, the step length (stride) is 1, the grouping (groups) parameter is 32, and the bias term (bias) is False; the third channel attention second convolution layer convolution kernel size is 1x1, the number of convolution kernels is 64, the step size (stride) is 1, the grouping (groups) parameter is 32, and the bias term (bias) is False; the third channel attention activation layer activation mode is 'Sigmoid'; the third spatial dimension attention layer comprises a third per-channel maximization layer, a third spatial dimension attention convolution layer and an activation function, wherein the third per-channel maximization layer adopts a max function of a pytorch self-contained body; the size of the convolution kernel of the third spatial dimension attention convolution layer is 3 × 3, the number of the convolution kernels is 1, the expansion factor (scaling) is 2, the padding parameter is 2, the grouping parameter is 1, the bias term (bias) is False, and the activation mode of the third spatial dimension attention activation layer is "Sigmoid". The width of each of the feature maps in RT3 and DT3
Figure BDA0002860067140000171
All heights are
Figure BDA0002860067140000172
For the sixth blend layer, the color image RGB input of the 6 th blend layer receives all the signatures in RA3 and DT3, and the 6 th blend layer passes through the existing add (element-by-element phase)Add) mode to fuse RA3 and DT3 to get aggregate RF3, the output RF3 of RGB color image output end of the 6 th fused layer; the Depth map Depth input end of the 6 th fusion layer receives all feature maps in RT3 and all feature maps in DA3, the 6 th fusion obtains an aggregate DF3 by fusing RT3 and DA3 in the existing add (element-by-element addition) mode, and the Depth map Depth output end of the 6 th fusion layer outputs DF 3; wherein the total number of the feature maps contained in the RF3 and the DF3 is 64, and the width of each feature map in the RF3 and the DF3 is 64
Figure BDA0002860067140000173
All heights are
Figure BDA0002860067140000174
For the 4 th feature re-extraction convolution block, sixty-three convolution layers, sixty-three normalization layers, sixty-three active layers, sixty-four convolution layers, sixty-four normalization layers, sixty-four active layers, sixty-five convolution layers, sixty-five normalization layers, sixty-five active layers, sixty-six convolution layers, sixty-six normalization layers and sixty-six active layers are sequentially arranged; the 4 th feature re-extraction volume block color image RGB input end receives all feature maps in R4, the 4 th feature re-extraction volume block output end outputs 96 feature maps, and the set formed by 96 feature maps is recorded as RS 4; the Depth map Depth input end of the 4 th feature re-extraction volume block receives all feature maps in D4, the output end of the 4 th feature re-extraction volume block outputs 96 feature maps, and a set formed by the 96 feature maps is recorded as DS 4; wherein the size of convolution kernels of the sixty-third convolution layer is 3 × 3, the number of convolution kernels is 96, the step size (stride) is 1, the dilation factor (scaling) is 1, the padding parameter is 1, the grouping parameter is 64, and the sixty-third layer normalization parameter is 96; the convolution kernel size of the sixty-fourth convolutional layer is 1x1, the number of convolution kernels is 48, the step size (stride) is 1, the padding parameter is 0, the grouping parameter is 48, and the sixty-fourth layer normalization parameter is 48; the sixty-fifth convolution layer has a convolution kernel size of 3 × 3, a number of convolution kernels of 96, a stride of 1, and a dilation factor of 2Zero padding (padding) parameter is 2, grouping (groups) parameter is 48, sixty-five layer normalization parameter is 96; the convolution kernel size of the sixtieth convolutional layer is 1 × 1, the number of convolution kernels is 96, the stride (stride) is 1, the padding (padding) parameter is 0, the grouping (groups) parameter is 96, and the sixtieth layer normalization parameter is 96. All the activation layers are activated in a way of 'ReLU 6', and the width of each feature map in RS4 and DS4 is
Figure BDA0002860067140000181
Has a height of
Figure BDA0002860067140000182
For the seventh fusion layer, the color image RGB input end of the 7 th fusion layer receives all the feature maps in R4 and all the feature maps in RS4, the 7 th fusion layer fuses R4 and RS4 in the existing add (element-by-element addition) mode to obtain a set RA4, and the color image RGB output end of the 7 th fusion layer outputs RA 4; the Depth map Depth input end of the 7 th fusion layer receives all feature maps in D4 and all feature maps in DS4, the 7 th fusion layer fuses D4 and DS4 in an existing add (element-by-element addition) mode to obtain a set DA4, and the Depth map Depth output end of the 7 th fusion layer outputs DA 4; wherein, the total number of the feature maps contained in RA4 and DA4 is 96, and the width of each feature map in RA4 and DA4 is 96
Figure BDA0002860067140000183
All heights are
Figure BDA0002860067140000184
For the 4 th block attention volume block, the fourth block attention layer, the fourth channel attention layer and the fourth space size attention layer are sequentially arranged; the RGB input end of color image of 4 th block attention volume block receives all characteristic maps of RA4, all characteristic maps of RA4 are input into the fourth block attention volume block, all the characteristic maps are input, the fourth block attention volume block divides all the characteristic maps into two parts according to channels by the fourth block layer, one part is used as the input of the channel attention layer,the other part is used as the input of a space size attention layer, finally, the results of attention processing at two sides are multiplied by the original input feature maps respectively, the multiplication results are added, 96 feature maps are output from the output end of the 4 th block attention volume block, and a set formed by the 96 feature maps is recorded as RT 4; the Depth map Depth input end of the 4 th block attention convolution block receives all feature maps of DA4 in the block, all feature maps in DA4 are input into a fourth block attention convolution block, all input feature maps are divided into two parts according to channels through a fourth block layer, one part is used as the input of a channel attention layer, the other part is used as the input of a space size attention layer, finally, the results of the two-side attention processing are multiplied by the original input feature maps respectively and then the multiplication results are added, 96 feature maps are output from the Depth map Depth output end of the 4 th block attention convolution block, and the set formed by the 96 feature maps is recorded as DT 4; the fourth block layer adopts a pitch function carried by the pytorech, and the parameters are half of the original input characteristic diagram and the number of corresponding channels; the fourth channel attention layer comprises a fourth adaptive maximum pooling layer, a fourth channel attention first convolution layer, a fourth channel attention second convolution layer and an active layer, wherein the fourth adaptive maximum pooling parameter is 1, the convolution kernel size of the fourth channel attention first convolution layer is 1x1, the number of convolution kernels is 48, the step size (stride) is 1, the grouping (groups) parameter is 48, and the bias term (bias) is False; the fourth channel attention second convolution layer convolution kernel size is 1x1, the number of convolution kernels is 96, the step size (stride) is 1, the grouping (groups) parameter is 48, and the bias term (bias) is False; the fourth channel attention activation layer activation mode is 'Sigmoid'; the fourth spatial dimension attention layer comprises a fourth per-channel maximization layer, a fourth spatial dimension attention convolution layer and an activation function, wherein the fourth per-channel maximization layer adopts a max function of a pytorch self-contained body; the size of the convolution kernel of the fourth spatial dimension attention convolution layer is 3 × 3, the number of the convolution kernels is 1, the expansion factor (scaling) is 2, the padding parameter is 2, the grouping parameter is 1, the bias term (bias) is False, and the activation mode of the fourth spatial dimension attention activation layer is "Sigmoid". Each of the feature maps in RT4 and DT4 has a width of
Figure BDA0002860067140000191
Has a height of
Figure BDA0002860067140000192
For the eighth fusion layer, the color image RGB input end of the 8 th fusion layer receives all the feature images in RA4 and DT4, the 8 th fusion layer fuses RA4 and DT4 in the existing add (element-by-element addition) mode to obtain an aggregate RF4, and the color image RGB output end of the 8 th fusion layer outputs RF 4; the Depth map Depth input end of the 8 th fusion layer receives all feature maps in RT4 and all feature maps in DA4, the 8 th fusion layer fuses RT4 and DA4 in the existing add (element-by-element addition) mode to obtain a set DF4, and the Depth map Depth output end of the 8 th fusion layer outputs DF 4; wherein, the total number of the feature maps contained in the RF4 and the DF4 is 96, and the width of each feature map in the RF4 and the DF4 is 96
Figure BDA0002860067140000201
All heights are
Figure BDA0002860067140000202
For the 5 th feature re-extraction volume block, the volume block consists of a sixty-seventh volume layer, a sixty-seventh normalization layer, a sixty-seventh active layer, a sixty-eight volume layer, a sixty-eight normalization layer, a sixty-eight active layer, a sixty-nine volume layer, a sixty-nine normalization layer, a sixty-nine active layer, a seventy volume layer, a seventy normalization layer and a seventy active layer which are sequentially arranged; the RGB input end of the color image of the 5 th feature re-extraction volume block receives all the feature maps in R5, the RGB output end of the color image of the 5 th feature re-extraction volume block outputs 160 feature maps, and the set formed by the 160 feature maps is recorded as RS 5; the Depth map Depth input end of the 5 th feature re-extraction volume block receives all the feature maps in D5, the Depth map Depth output end of the 5 th feature re-extraction volume block outputs 160 feature maps, and a set formed by the 160 feature maps is recorded as DS 5; wherein the sixty-seventh convolutional layer has a convolutional kernel size of 3 × 3 and a convolutional kernel numberThe number is 160, the step size (stride) is 1, the dilation factor (scaling) is 1, the padding parameter is 1, the grouping parameter is 96, and the sixty-third layer normalization parameter is 160; the size of a convolution kernel of the sixty-eight convolutional layer is 1x1, the number of the convolution kernels is 80, the step size (stride) is 1, the padding parameter is 0, the grouping parameter is 80, and the sixty-eight layer normalization parameter is 80; the size of convolution kernels of the sixty-ninth convolutional layer is 3 × 3, the number of convolution kernels is 160, the step size (stride) is 1, the dilation factor (scaling) is 2, the padding parameter is 2, the grouping parameter is 80, and the sixty-ninth layer normalization parameter is 160; the convolution kernel size of the seventy convolutional layer is 1 × 1, the number of convolution kernels is 160, the stride is 1, the padding parameter is 0, the grouping parameter is 160, and the seventy layer normalization parameter is 160. All the activation layers are activated in a way of 'ReLU 6', and the width of each feature map in RS5 and DS5 is
Figure BDA0002860067140000211
Has a height of
Figure BDA0002860067140000212
For the ninth fusion layer, the color image RGB input end of the 9 th fusion layer receives all the feature maps in R5 and all the feature maps in RS5, the 9 th fusion layer fuses R5 and RS5 in the existing add (element-by-element addition) mode to obtain a set RA5, and the color image RGB output end of the 9 th fusion layer outputs RA 5; the Depth map Depth input end of the 9 th fusion layer receives all feature maps in D5 and all feature maps in DS5, the 9 th fusion layer fuses D5 and DS5 in an existing add (element-by-element addition) mode to obtain a set DA5, and the Depth map Depth output end of the 9 th fusion layer outputs DA 5; wherein, the total breadth of the characteristic maps contained in RA5 and DA5 is 160, and the breadth of each characteristic map in RA5 and DA5 is 160
Figure BDA0002860067140000213
All heights are
Figure BDA0002860067140000214
For the 5 th block attention volume block, a fifth block layer, a fifth channel attention layer and a fifth space size attention layer are sequentially arranged; the RGB input end of color map of 5 th sub-block attention volume block receives all feature maps of RA5, all feature maps in RA5 input the fifth sub-block attention volume block, firstly all feature maps input are divided into two parts by the fifth sub-block layer according to channels, one part is used as input of channel attention layer, the other part is used as input of space size attention layer, finally the result of two-side attention processing is multiplied with the original input feature maps respectively and then the multiplication results are added, the RGB output end of color map of 5 th sub-block attention volume block outputs 160 feature maps, and the formed set is marked as RT 5; the Depth map Depth input end of the 5 th block attention volume block receives all feature maps of DA5, all feature maps in DA5 are input into a fifth block attention volume block, all input feature maps are divided into two parts according to channels through a fifth block layer, one part is used as the input of a channel attention layer, the other part is used as the input of a space size attention layer, finally, the results of two-side attention processing are multiplied by original input feature maps respectively, the multiplication results are added, the Depth map Depth output end of the 5 th block attention volume block outputs 160 feature maps, and a formed set is marked as DT5, wherein the fifth block layer adopts a pitch function carried by a pytorch, and parameters are half of the original input feature maps and corresponding channels; the fifth channel attention layer comprises a fifth adaptive maximum pooling layer, a fifth channel attention first convolution layer, a fifth channel attention second convolution layer and an active layer, wherein a fifth adaptive maximum pooling parameter is 1, the convolution kernel size of the fifth channel attention first convolution layer is 1x1, the number of the convolution kernels is 80, the step length (stride) is 1, the grouping (groups) parameter is 80, and the bias term (bias) is False; the fifth channel attention second convolution layer convolution kernel size is 1x1, the number of convolution kernels is 160, the step size (stride) is 1, the grouping (groups) parameter is 80, and the bias term (bias) is False; the fifth channel attention activation layer activation mode is 'Sigmoid'; the fifth spatial dimension attention layer comprises a fifth per-channel maximization layer and a fifth spatial dimension noteThe method comprises the steps of (1) an intention convolution layer and an activation function, wherein the fifth channel-based maximization layer adopts a max function of a pytorech; the fifth spatial dimension attention convolution kernel has a size of 3 × 3, the number of convolution kernels is 1, a dilation factor (scaling) is 2, a zero padding (padding) parameter is 2, a grouping (groups) parameter is 1, a bias term (bias) is False, and the fifth spatial dimension attention activation layer activation manner is "Sigmoid". Each of the feature maps in RT5 and DT5 has a width of
Figure BDA0002860067140000221
Has a height of
Figure BDA0002860067140000222
For the tenth fusion layer, the color image RGB input end of the 10 th fusion layer receives all the feature images in RA5 and DT5, the 10 th fusion layer fuses RA5 and DT5 in the existing add (element-by-element addition) mode to obtain an aggregate RF5, and the color image RGB output end of the 10 th fusion layer outputs RF 5; the Depth map Depth input end of the 10 th fusion layer receives all feature maps in RT5 and all feature maps in DA5, the 10 th fusion layer fuses RT5 and DA5 in an existing add (element-by-element addition) mode to obtain a set DF5, and the Depth map Depth output end of the 10 th fusion layer outputs DF 5; wherein, the total width of the characteristic maps contained in the RF5 and the DF5 is 160, and the width of each characteristic map in the RF5 and the DF5 is 160
Figure BDA0002860067140000223
All heights are
Figure BDA0002860067140000224
For the first upsampling layer, the input end of the first upsampling layer receives all feature maps in the RF2 and all feature maps in the DF2, the RF2 and the DF2 are fused in the existing add (element-by-element addition) mode to obtain an set RD2, the first upsampling layer obtains an RD2 feature map (with height and width) amplified by two times through an associated upsampling Biliner 2d function in the pytorch, the function parameter is 2, the set formed by the function is denoted as RD2x2, and the first upsampling layer receives the feature mapThe layer output terminal outputs RD2x 2; wherein the total number of the feature maps (number of channels) contained in RD2x2 is 32, and the width of each feature map in RD2x2 is
Figure BDA0002860067140000231
Has a height of
Figure BDA0002860067140000232
For the second upsampling layer, the input end of the second upsampling layer receives all feature maps in the RF3 and all feature maps in the DF3, the RF3 and the DF3 are fused in the existing add (element-by-element addition) mode to obtain a set RD3, the second upsampling layer obtains an RD3 feature map amplified by four times through an associated upsampling Biliner 2d function in the pytorch, the function parameter is 4, the set formed by the function parameter is denoted as RD3x4, and the output end of the second upsampling layer outputs RD3x 4; wherein RD3x4 contains feature maps with a total width of 64 and each feature map in RD3x4 has a width of 64
Figure BDA0002860067140000233
Has a height of
Figure BDA0002860067140000234
For the third upsampling layer, the input end of the third upsampling layer receives all feature maps in RF4 and all feature maps in DF4, the RF4 and DF4 are fused in the existing add (element-by-element addition) mode to obtain a set RD4, the third upsampling layer obtains an RD4 feature map amplified by four times through an associated upsampling Biliner 2d function in the pytorch, the function parameter is 4, the set formed by the function parameter is denoted as RD4x4, and the output end of the third upsampling layer outputs RD4x 4; wherein the total number of the feature maps contained in RD4x4 is 96, and the width of each feature map in RD4x4 is
Figure BDA0002860067140000235
Has a height of
Figure BDA0002860067140000236
For the third upsampling layer, the input end of the third upsampling layer receives all feature maps in RF4 and all feature maps in DF4, the RF4 and DF4 are fused in the existing add (element-by-element addition) mode to obtain a set RD4, the third upsampling layer obtains an RD4 feature map amplified by four times through an associated upsampling Biliner 2d function in the pytorch, the function parameter is 4, the set formed by the function parameter is denoted as RD4x4, and the output end of the third upsampling layer outputs RD4x 4; wherein the total number of the feature maps contained in RD4x4 is 96, and the width of each feature map in RD4x4 is
Figure BDA0002860067140000237
Has a height of
Figure BDA0002860067140000238
For the fourth upsampling layer, the input end of the fourth upsampling layer receives all feature maps in RF5 and all feature maps in DF5, the RF5 and DF5 are fused in the existing add (element-by-element addition) mode to obtain a set RD5, the fourth upsampling layer obtains an RD5 feature map amplified by four times through an associated upsampling Biliner 2d function in the pytorch, the function parameter is 4, the set formed by the function parameter is denoted as RD5x4, and the output end of the fourth upsampling layer outputs RD5x 4; wherein RD5x4 contains feature maps with total width of 160 and each feature map in RD5x4 has width of 160
Figure BDA0002860067140000241
Has a height of
Figure BDA0002860067140000242
For the eleventh fusion layer, the input end of the eleventh fusion layer receives all feature maps in RF1 and all feature maps in DF1, and RF1 and DF1 are fused in an existing add (element-by-element addition) mode to obtain a set RD1, wherein the total amplitude of the feature maps contained in RD1 is 24, and the width of each feature map in RD1 is equal to the height of each feature map in RD1
Figure BDA0002860067140000243
Has a height of
Figure BDA0002860067140000244
For the twelfth fusion layer, the input end of the twelfth fusion layer receives all the feature maps of RD1, RD2x2, RD3x4, RD4x4 and RD5x4, the twelfth fusion layer is connected with RD1, RD2x2, RD3x4, RD4x4 and RD5x4 in the existing concatenate mode to obtain a set RDM, and the output end of the twelfth fusion layer outputs the RDM; the total breadth of the feature maps contained in the RDM is 376(24+32+64+96+160), and the width of each feature map in the RDM is 376
Figure BDA0002860067140000245
For the output layer, the output layer is composed of a seventy-th convolution layer and a fifth up-sampling layer which are sequentially connected, wherein the size of convolution kernels of the seventy-th convolution layer is 1 multiplied by 1, the number of the convolution kernels is 41, the step length (stride) is 1, and the bias term (bias) is False; the fifth upsampling layer obtains a feature map amplified by four times through an upsampling Biliner 2d function in the pytorech, and the parameter of the function is 4; the input end of the output layer receives all the feature maps in the RDM, and the output end of the output layer outputs 41 semantic segmentation prediction maps corresponding to the original input image.
Step 1_ 3: inputting each pair of original indoor scene RGB images and Depth images in the training set as original input images into a convolutional neural network for training to obtain 41 semantic segmentation prediction graphs corresponding to each pair of original indoor scene images in the training set, and performing { RGB (red green blue) prediction on the semantic segmentation prediction graphsq(i,j),Depthq(i, j) } the set formed by the 41 semantic segmentation prediction graphs is recorded as
Figure BDA0002860067140000246
Step 1_ 4: calculating loss function values between a set formed by 41 semantic segmentation prediction images corresponding to each pair of original indoor scene images in a training set and a set formed by 41 single-hot coded images processed by corresponding real semantic segmentation images, and converting the loss function values into the loss function values
Figure BDA0002860067140000247
And
Figure BDA0002860067140000248
the value of the loss function in between is recorded as
Figure BDA0002860067140000249
Obtained using categorical cross entropy (categorical cross entropy).
Step 1_ 5: repeatedly executing the step 1_3 and the step 1_4 for V times to obtain a convolutional neural network classification training model, and obtaining Q multiplied by V loss function values; then finding out the loss function value with the minimum value from the Q multiplied by V loss function values; and then, correspondingly taking the weight vector and the bias item corresponding to the loss function value with the minimum value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as WbestAnd bbest(ii) a Where V > 1, in this example V is 500.
The test stage process comprises the following specific steps:
step 2_ 1: separating P color map and depth map from original data set as test set, where P is positive integer, and taking 699 as P in the invention to make { RGBp(i',j'),Depthp(i ', j') } represents an indoor scene image to be subjected to semantic segmentation in the test set, and P is more than or equal to 1 and less than or equal to P; wherein i ' is not less than 1 and not more than W ', j ' is not less than 1 and not more than H ', and W ' represents { RGB ≦p(i',j'),Depthp(i ', j ') } widths of color and depth maps, H ' denotes { RGB }p(i',j'),Depthp(i ', j)' } height of color and depth maps, RGBp(i',j'),Depthp(i ', j') denote { RGB } respectivelyp(i',j'),Depthp(i ', j') } the pixel value of the pixel point with the coordinate position of the color map RGB and the Depth map Depth being (i, j).
Step 2_ 2: will { RGB }p(i',j'),DepthpColor image RGB and Depth image Depth in (i ', j') } are input into a trained improved full convolution neural network semantic segmentation model, and convolution kernel optimal weight W obtained in training is utilizedbestAnd an optimum bias term bbestPredict to obtain { RGBp(i',j'),Depthp(i ', j') } corresponding predicted semantically segmented images, noted as
Figure BDA0002860067140000251
Wherein,
Figure BDA0002860067140000252
to represent
Figure BDA0002860067140000253
And the pixel value of the pixel point with the middle coordinate position of (i ', j').
To further verify the effects of the present invention, the following experiments were conducted.
The invention uses an architecture of an improved full-convolution neural network semantic segmentation model built by a deep learning library Pytrich1.1.0 based on python 3.6. The indoor scene image database NYUv2 test set is adopted to analyze the segmentation effect of the indoor scene image (699 pairs of indoor scene images) predicted by the method. Here, 3 common objective parameters for evaluating the semantic segmentation method are used as evaluation indexes, namely Class Accuracy (Class Accuracy), Mean Pixel Accuracy (MPA), and a ratio of Intersection to Union of the segmented image and the label image (Mean Intersection over unit, MIoU) to evaluate the segmentation performance of the predicted semantic segmentation image. The higher the three types of index values are, the better the model performance is.
The method of the invention is used for predicting each pair of indoor scene images in the indoor scene image database NYUv2 test set to obtain the predicted semantic segmentation images corresponding to each pair of indoor images, and the class accuracy CA reflecting the semantic segmentation effect of the method of the invention, the average pixel accuracy MPA and the ratio MIoU of intersection and union of the segmentation images and the label images are listed in Table 1. As can be seen from the data in Table 1, the method of the present invention obtains good results on three indexes, namely class accuracy CA, average pixel accuracy MPA, and ratio MIoU of intersection and union of the segmentation image and the label image, and the effectiveness of the method of the present invention is demonstrated.
TABLE 1 evaluation results on test sets using the method of the invention
CA 61.01%
MPA 74.34%
MIoU 47.92%
Fig. 2a shows the 1 st original color RGB image of the indoor scene of the same scene, fig. 2b shows the 1 st original Depth image of the indoor scene of the same scene, and fig. 2c shows the predicted semantic segmentation image obtained by predicting the original indoor scene images shown in fig. 2a and 2b by using the method of the present invention; fig. 3a shows the 2 nd original color RGB image of the indoor scene of the same scene, fig. 3b shows the 2 nd original Depth image of the indoor scene of the same scene, and fig. 3c shows the predicted semantic segmentation image obtained by predicting the original indoor scene images shown in fig. 3a and 3b by using the method of the present invention; fig. 4a shows a3 rd original color RGB image of an indoor scene of the same scene, fig. 4b shows a3 rd original Depth image of an indoor scene of the same scene, and fig. 4c shows a predicted semantic segmentation image obtained by predicting the original indoor scene images shown in fig. 4a and 4b by using the method of the present invention; fig. 5a shows the 4 th original color RGB image of the indoor scene in the same scene, fig. 5b shows the 4 th original Depth image of the indoor scene in the same scene, and fig. 5c shows the predicted semantic segmentation image obtained by predicting the original indoor scene images shown in fig. 5a and 5b by using the method of the present invention. Comparing fig. 2a, 2b and fig. 2c, comparing fig. 3a, 3b and fig. 3c, comparing fig. 4a, 4b and fig. 4c, comparing fig. 5a, 5b and fig. 5c, it can be seen that the segmentation precision of the predicted semantic segmentation image obtained by the method of the present invention is higher.

Claims (9)

1. The indoor scene semantic segmentation method based on the improved full convolution neural network is characterized by comprising the following steps of:
step 1: selecting Q pairs of original indoor scene images and corresponding real semantic segmentation images, and forming a training set by all the original indoor scene images and the corresponding real semantic segmentation images; each pair of original indoor scene images comprises an original indoor scene color image and an original indoor scene depth image, and the real semantic segmentation images in the training set are processed into 41 independent thermal coding images by adopting an independent thermal coding technology;
step 2: constructing a convolutional neural network classification training model: the convolutional neural network classification training model comprises an input layer, a hidden layer and an output layer; the input layer comprises a color image input layer and a depth image input layer; the hidden layer comprises a color image processing module and a depth image processing module; the color image processing module and the depth image processing module are symmetrical in structure and respectively comprise five neural network blocks, five feature re-extraction volume blocks and ten fusion layers; the hidden layer also comprises five block attention volume blocks, four upper sampling layers and two fusion layers;
and step 3: inputting the training set into the convolutional neural network classification training model in the step 2 for training, in the training process, performing iterative training processing each time to obtain 41 semantic segmentation predicted images corresponding to each pair of original indoor scene images, and calculating a loss function value between a set formed by the 41 semantic segmentation predicted images and a set formed by 41 one-hot coded images corresponding to real semantic segmentation images;
and 4, step 4: repeating the step 3 for a total of V times to obtain Q multiplied by V loss function values; then finding out the minimum loss function value from the Q multiplied by V loss function values, and taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, thereby completing the training of the convolutional neural network classification training model;
and 5: and performing prediction processing on an indoor scene image to be predicted by using the convolutional neural network classification training model obtained after training, outputting and obtaining a corresponding prediction semantic segmentation image, and realizing indoor scene image semantic segmentation.
2. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 1, wherein: the step 2) is specifically as follows:
the color image input layer and the depth image input layer are respectively input into a first neural network block in the color image processing module and the depth image processing module;
the color image processing module and the depth image processing module have the same structure, and specifically comprise:
one output of the first neural network block is input into a first fusion layer through a first feature re-extraction convolution block, and the other output of the first neural network block is input into a first fusion layer; one output of the second neural network block is input into a third fusion layer through a second feature re-extraction convolution block, and the other output of the second neural network block is input into the third fusion layer; one output of the third neural network block is input into a fifth fusion layer through a third feature re-extraction convolution block, and the other output of the third neural network block is input into the fifth fusion layer; one output of the fourth neural network block is input into the seventh fusion layer through the fourth feature re-extraction convolution block, and the other output of the fourth neural network block is input into the seventh fusion layer; one output of the fifth neural network block is input into a ninth fusion layer through a fifth feature re-extraction convolution block, and the other output of the fifth neural network block is input into the ninth fusion layer; the two inputs of each fusion layer are fused in an element-by-element addition mode;
the output of the first fusion layer is respectively input into a first block attention volume block and a corresponding second fusion layer, the output of the third fusion layer is respectively input into a second block attention volume block and a corresponding fourth fusion layer, the output of the fifth fusion layer is respectively input into a third block attention volume block and a corresponding sixth fusion layer, the output of the seventh fusion layer is respectively input into a fourth block attention volume block and a corresponding eighth fusion layer, and the output of the ninth fusion layer is respectively input into a fifth block attention volume block and a corresponding tenth fusion layer;
two outputs of the first block attention volume block are respectively input into a second fusion layer of the color image processing module and a second fusion layer of the depth image processing module, two outputs of the second block attention volume block are respectively input into a fourth fusion layer of the color image processing module and a fourth fusion layer of the depth image processing module, two outputs of the third block attention volume block are respectively input into a sixth fusion layer of the color image processing module and a sixth fusion layer of the depth image processing module, two outputs of the fourth block attention volume block are respectively input into an eighth fusion layer of the color image processing module and a eighth fusion layer of the depth image processing module, and two outputs of the fifth block attention volume block are respectively input into a tenth fusion layer of the color image processing module and the tenth fusion layer of the depth image processing module;
the two inputs of the second fusion layer are fused in an element-by-element addition mode and then respectively input into an eleventh fusion layer and a corresponding second neural network block, the two inputs of the fourth fusion layer are fused in an element-by-element addition mode and then respectively input into a first up-sampling layer and a corresponding third neural network block, the two inputs of the sixth fusion layer are fused in an element-by-element addition mode and then respectively input into a second up-sampling layer and a corresponding fourth neural network block, and the two inputs of the eighth fusion layer are fused in an element-by-element addition mode and then respectively input into a third up-sampling layer and a corresponding fifth neural network block; the output of the tenth fusion layer is input into the fourth upsampling layer;
the two inputs of the eleventh fusion layer, the first up-sampling layer, the second up-sampling layer, the third up-sampling layer and the fourth up-sampling layer are fused in an element-by-element addition mode and then are all input into the twelfth fusion layer;
and all the inputs of the twelfth fusion layer are connected in a concatenate mode and then output through the output layer, and the output layer mainly comprises a convolution layer and a fifth upper sampling layer which are sequentially connected.
3. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 1, wherein: the five neural network blocks adopt a MobileNet V2 network structure, the first neural network block adopts 1-4 layers in MobileNet V2, the second neural network block adopts 5-7 layers in MobileNet V2, the third neural network block adopts 8-11 layers in MobileNet V2, the fourth neural network block adopts 12-14 layers in MobileNet V2, and the fifth neural network block adopts 15-17 layers in MobileNet V2.
4. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 2, wherein: each feature re-extraction convolution block consists of four re-extraction modules which are connected in sequence, and each re-extraction module comprises a convolution layer, a normalization layer and an activation layer which are connected in sequence;
the activation mode of all the activation layers adopts ReLU 6; the step length of all the convolution layers is 1; the number of convolution kernels of the convolution layer in each re-extraction module is the same as the standardized parameters of the standardized layer;
the convolution kernels of the convolution layers in the first re-extraction module, the third re-extraction module and the fourth re-extraction module are the same in size, and the convolution kernel of the convolution layer in the second re-extraction module is half of the convolution kernel of the convolution layer in the first re-extraction module;
the convolution kernels of the convolution layers in the first re-extraction module and the third re-extraction module are both 3x 3, and the convolution kernels of the convolution layers in the second re-extraction module and the fourth re-extraction module are both 1x 1; the expansion factor of the convolution layer in the first re-extraction module is 1, and the expansion factor of the convolution layer in the third re-extraction module is 2; the zero padding parameter of the convolution layer in the first re-extraction module is 1, the zero padding parameter of the convolution layer in the third re-extraction module is 2, and the zero padding parameters of the convolution layers in the second re-extraction module and the fourth re-extraction module are both 0.
5. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 4, wherein: the convolution kernel size of the convolution layer of the first re-extraction module in the first feature re-extraction volume block is 24, the convolution kernel size of the convolution layer of the first re-extraction module in the second feature re-extraction volume block is 32, the convolution kernel size of the convolution layer of the first re-extraction module in the third feature re-extraction volume block is 64, the convolution kernel size of the convolution layer of the first re-extraction module in the fourth feature re-extraction volume block is 96, and the convolution kernel size of the convolution layer of the first re-extraction module in the fifth feature re-extraction volume block is 160;
grouping parameters of four sequentially arranged convolutional layers in the first feature re-extraction convolutional block are respectively 24, 12, 24 and 24, grouping parameters of four sequentially arranged convolutional layers in the second feature re-extraction convolutional block are respectively 32, 16 and 32, grouping parameters of four sequentially arranged convolutional layers in the third feature re-extraction convolutional block are respectively 64, 32 and 64, grouping parameters of four sequentially arranged convolutional layers in the fourth feature re-extraction convolutional block are respectively 64, 48 and 99, and grouping parameters of four sequentially arranged convolutional layers in the fifth feature re-extraction convolutional block are respectively 96, 80 and 160.
6. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 2, wherein: each of the block attention convolution blocks comprises a block layer, a channel attention layer and a space size attention layer, wherein the input of the block attention convolution block is input into the channel attention layer and the space size attention layer after passing through the block layer, the input of the block attention convolution block is multiplied by the output of the channel attention layer and the output of the space size attention layer, and the multiplied results are added to be used as the output of the block attention convolution block.
7. The method for indoor scene semantic segmentation based on the improved full convolution neural network as claimed in claim 6, wherein: the blocking layer adopts a split function carried by the pytorech, and the parameter is half of the number of channels corresponding to the input characteristic diagram of the blocking attention volume block;
the channel attention layer comprises an adaptive maximum pooling layer, a channel attention first convolution layer, a channel attention second convolution layer and a channel attention activation layer which are sequentially connected; the maximum pooling parameter of the adaptive maximum pooling layer is 1; the convolution kernels of the channel attention first convolution layer and the channel attention second convolution layer are both 1x1 in size, the step length is 1, the bias terms are False, the grouping parameters are the same, and the number of convolution kernels in the channel attention first convolution layer is half of that of the channel attention second convolution layer;
the number of convolution kernels of the channel attention first convolution layer of the channel attention layer in the first block attention convolution block is 12, and the grouping parameter is 12; the number of convolution kernels of the channel attention first convolution layer of the channel attention layer in the second block attention convolution block is 16, and the grouping parameter is 16; the number of convolution kernels of the first convolution layer of the channel attention layer in the third block attention convolution block is 32, and the grouping parameter is 32; the number of convolution kernels of the first convolution layer of the channel attention layer in the fourth block attention convolution block is 48, and the grouping parameter is 48; the number of convolution kernels of the first convolution layer of the channel attention layer in the fifth block attention convolution block is 80, and the grouping parameter is 80;
the spatial dimension attention layer comprises a per-channel maximization layer, a spatial dimension attention convolution layer and a spatial dimension attention activation layer which are sequentially connected; adopting a max function carried by the pytorch according to a channel maximization layer; the convolution kernel size of the spatial dimension attention convolution layer is 3 multiplied by 3, the number of the convolution kernels is 1, the expansion factor is 2, the zero padding parameter is 2, the grouping parameter is 1, and the bias term is False;
the channel attention layer and the space size attention layer adopt an activation mode which is a Sigmoid function.
8. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 2, wherein: the upsampling layer adopts an upsampling Biliner 2d function carried in the pytorech, the function parameter of the first upsampling layer is 2, and the function parameters of the second upsampling layer, the third upsampling layer, the fourth upsampling layer and the fifth upsampling layer are 4; the convolution kernel size of the convolution layer in the output layer is 1 multiplied by 1, the number of the convolution kernels is 41, the step length is 1, and the bias term is False.
9. The indoor scene semantic segmentation method based on the improved full convolution neural network as claimed in claim 4, wherein: the input end of the color image input layer receives an indoor scene color image, the input end of the depth image input layer receives an indoor scene depth image, and the output of the output layer is 41 semantic segmentation predicted images corresponding to the indoor scene image input by the input layer.
CN202011559942.5A 2020-12-25 2020-12-25 Indoor scene semantic segmentation method based on improved full convolution neural network Pending CN112598675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011559942.5A CN112598675A (en) 2020-12-25 2020-12-25 Indoor scene semantic segmentation method based on improved full convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011559942.5A CN112598675A (en) 2020-12-25 2020-12-25 Indoor scene semantic segmentation method based on improved full convolution neural network

Publications (1)

Publication Number Publication Date
CN112598675A true CN112598675A (en) 2021-04-02

Family

ID=75202440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011559942.5A Pending CN112598675A (en) 2020-12-25 2020-12-25 Indoor scene semantic segmentation method based on improved full convolution neural network

Country Status (1)

Country Link
CN (1) CN112598675A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
CN113191213A (en) * 2021-04-12 2021-07-30 桂林电子科技大学 High-resolution remote sensing image newly-added building detection method
CN113469064A (en) * 2021-07-05 2021-10-01 安徽大学 Method and system for identifying corn leaf disease images in complex environment
CN114723951A (en) * 2022-06-08 2022-07-08 成都信息工程大学 Method for RGB-D image segmentation
CN115601550A (en) * 2022-12-13 2023-01-13 深圳思谋信息科技有限公司(Cn) Model determination method, model determination device, computer equipment and computer-readable storage medium
CN116757988A (en) * 2023-08-17 2023-09-15 齐鲁工业大学(山东省科学院) Infrared and visible light image fusion method based on semantic enrichment and segmentation tasks
CN118172559A (en) * 2024-05-15 2024-06-11 齐鲁工业大学(山东省科学院) Image fusion method based on semantic segmentation and extraction of edge features and gradient features

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403430A (en) * 2017-06-15 2017-11-28 中山大学 A kind of RGBD image, semantics dividing method
CN107563388A (en) * 2017-09-18 2018-01-09 东北大学 A kind of convolutional neural networks object identification method based on depth information pre-segmentation
KR101970488B1 (en) * 2017-12-28 2019-04-19 포항공과대학교 산학협력단 RGB-D Multi-layer Residual Feature Fusion Network for Indoor Semantic Segmentation
CN110619638A (en) * 2019-08-22 2019-12-27 浙江科技学院 Multi-mode fusion significance detection method based on convolution block attention module
CN110782462A (en) * 2019-10-30 2020-02-11 浙江科技学院 Semantic segmentation method based on double-flow feature fusion
CN111563507A (en) * 2020-04-14 2020-08-21 浙江科技学院 Indoor scene semantic segmentation method based on convolutional neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107403430A (en) * 2017-06-15 2017-11-28 中山大学 A kind of RGBD image, semantics dividing method
CN107563388A (en) * 2017-09-18 2018-01-09 东北大学 A kind of convolutional neural networks object identification method based on depth information pre-segmentation
KR101970488B1 (en) * 2017-12-28 2019-04-19 포항공과대학교 산학협력단 RGB-D Multi-layer Residual Feature Fusion Network for Indoor Semantic Segmentation
CN110619638A (en) * 2019-08-22 2019-12-27 浙江科技学院 Multi-mode fusion significance detection method based on convolution block attention module
CN110782462A (en) * 2019-10-30 2020-02-11 浙江科技学院 Semantic segmentation method based on double-flow feature fusion
CN111563507A (en) * 2020-04-14 2020-08-21 浙江科技学院 Indoor scene semantic segmentation method based on convolutional neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王孙平;陈世峰;: "融合深度图像的卷积神经网络语义分割方法", 集成技术, no. 05, 7 June 2018 (2018-06-07) *
章程: "基于深度学习和迁移学习联合的RGB-D语义分割模型研究", 中国硕士论文数据库 信息科技辑, 15 August 2019 (2019-08-15) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192073A (en) * 2021-04-06 2021-07-30 浙江科技学院 Clothing semantic segmentation method based on cross fusion network
CN113191213A (en) * 2021-04-12 2021-07-30 桂林电子科技大学 High-resolution remote sensing image newly-added building detection method
CN113469064A (en) * 2021-07-05 2021-10-01 安徽大学 Method and system for identifying corn leaf disease images in complex environment
CN113469064B (en) * 2021-07-05 2024-03-29 安徽大学 Identification method and system for corn leaf disease image in complex environment
CN114723951A (en) * 2022-06-08 2022-07-08 成都信息工程大学 Method for RGB-D image segmentation
CN115601550A (en) * 2022-12-13 2023-01-13 深圳思谋信息科技有限公司(Cn) Model determination method, model determination device, computer equipment and computer-readable storage medium
CN116757988A (en) * 2023-08-17 2023-09-15 齐鲁工业大学(山东省科学院) Infrared and visible light image fusion method based on semantic enrichment and segmentation tasks
CN116757988B (en) * 2023-08-17 2023-12-22 齐鲁工业大学(山东省科学院) Infrared and visible light image fusion method based on semantic enrichment and segmentation tasks
CN118172559A (en) * 2024-05-15 2024-06-11 齐鲁工业大学(山东省科学院) Image fusion method based on semantic segmentation and extraction of edge features and gradient features
CN118172559B (en) * 2024-05-15 2024-07-23 齐鲁工业大学(山东省科学院) Image fusion method based on semantic segmentation and extraction of edge features and gradient features

Similar Documents

Publication Publication Date Title
CN112598675A (en) Indoor scene semantic segmentation method based on improved full convolution neural network
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN112766087A (en) Optical remote sensing image ship detection method based on knowledge distillation
CN107679462A (en) A kind of depth multiple features fusion sorting technique based on small echo
CN107832835A (en) The light weight method and device of a kind of convolutional neural networks
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN113034506B (en) Remote sensing image semantic segmentation method and device, computer equipment and storage medium
CN110705566B (en) Multi-mode fusion significance detection method based on spatial pyramid pool
CN117237559B (en) Digital twin city-oriented three-dimensional model data intelligent analysis method and system
CN110782458B (en) Object image 3D semantic prediction segmentation method of asymmetric coding network
CN115035508B (en) Theme-guided transducer-based remote sensing image subtitle generation method
CN109508639B (en) Road scene semantic segmentation method based on multi-scale porous convolutional neural network
CN112686276A (en) Flame detection method based on improved RetinaNet network
CN113192073A (en) Clothing semantic segmentation method based on cross fusion network
CN109862350A (en) No-reference video quality evaluating method based on time-space domain feature extraction
CN113781410B (en) Medical image segmentation method and system based on MEDU-Net+network
CN113313721B (en) Real-time semantic segmentation method based on multi-scale structure
CN113255574B (en) Urban street semantic segmentation method and automatic driving method
Lin et al. Full-scale selective transformer for semantic segmentation
CN110738645A (en) 3D image quality detection method based on convolutional neural network
Wang et al. Dynamic-boosting attention for self-supervised video representation learning
CN110287763A (en) A kind of candidate frame ratio optimization method towards ship seakeeping application
CN113743188B (en) Feature fusion-based internet video low-custom behavior detection method
Hasan et al. E-DARTS: Enhanced differentiable architecture search for acoustic scene classification
Guo et al. Lae-net: Light and efficient network for compressed video action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination