CN116468889B - Panorama segmentation method and system based on multi-branch feature extraction - Google Patents

Panorama segmentation method and system based on multi-branch feature extraction Download PDF

Info

Publication number
CN116468889B
CN116468889B CN202310356730.4A CN202310356730A CN116468889B CN 116468889 B CN116468889 B CN 116468889B CN 202310356730 A CN202310356730 A CN 202310356730A CN 116468889 B CN116468889 B CN 116468889B
Authority
CN
China
Prior art keywords
convolution
feature map
module
layer
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310356730.4A
Other languages
Chinese (zh)
Other versions
CN116468889A (en
Inventor
孙庆伟
晁建刚
林万洪
陈炜
何宁
许振瑛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peoples Liberation Army Strategic Support Force Aerospace Engineering University
China Astronaut Research and Training Center
Original Assignee
Peoples Liberation Army Strategic Support Force Aerospace Engineering University
China Astronaut Research and Training Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peoples Liberation Army Strategic Support Force Aerospace Engineering University, China Astronaut Research and Training Center filed Critical Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority to CN202310356730.4A priority Critical patent/CN116468889B/en
Publication of CN116468889A publication Critical patent/CN116468889A/en
Application granted granted Critical
Publication of CN116468889B publication Critical patent/CN116468889B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application belongs to the technical field of image processing, and particularly provides a panoramic segmentation method and a panoramic segmentation system based on multi-branch feature extraction, wherein the method comprises the following steps: preprocessing an RGB image to obtain an initial image; inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image; the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network, and a post-processing module. According to the technical scheme provided by the application, different foreground objects and backgrounds can be accurately identified, and the edges of the objects can be accurately segmented; and rich receptive fields and spatial information are extracted by fusing high-dimensional features and low-dimensional features of the image, so that the overall accuracy of panoramic segmentation is improved.

Description

Panorama segmentation method and system based on multi-branch feature extraction
Technical Field
The application belongs to the technical field of image processing, and particularly provides a panoramic segmentation method and a panoramic segmentation system based on multi-branch feature extraction.
Background
Panoramic segmentation is complete analysis of an image, the image can be segmented into a foreground and a background, different numbers are assigned to different examples of foreground objects, each pixel in the image has an independent semantic tag and an independent example number, and pixel categories are not overlapped, so that downstream tasks are facilitated.
The existing panoramic segmentation frame is divided into two types, one is a two-branch structure for processing foreground and background separately, and the other is an end-to-end structure for processing foreground and background simultaneously. In the two-branch structure, one branch is responsible for predicting an instance structure of a foreground part, and is equivalent to instance segmentation; the other branch is responsible for distinguishing the background from the foreground, and is equivalent to semantic segmentation, and the processing results of the two branches are fused to obtain a panoramic segmentation result. However, the panorama segmentation network with the two-branch structure needs to perform semantic segmentation and instance segmentation respectively, and a special fusion scheme needs to be designed aiming at the processing results of the two branches, so that the network structure is redundant, and the panorama segmentation precision is not high. The end-to-end structure uniformly processes the background and the foreground, and the fusion step of the example segmentation and the semantic segmentation results is eliminated. But the end-to-end panoramic segmentation network processes the foreground and the background simultaneously, can not integrate low-dimensional features and high-dimensional features, and has unclear object detail and edge segmentation results.
Disclosure of Invention
In order to overcome the problems existing in the related art to at least a certain extent, the invention provides a panorama segmentation method and a panorama segmentation system based on multi-branch feature extraction.
In a first aspect, there is provided a panorama segmentation method based on multi-branch feature extraction, the method comprising:
preprocessing an RGB image to obtain an initial image;
inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image;
the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network, and a post-processing module.
Preferably, the preprocessing the RGB image to obtain an initial image includes:
and normalizing the resolution of the RGB image to 512 x 1024 to obtain the initial image.
Preferably, the inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image includes:
inputting the initial image into the backbone network for high-dimensional feature extraction, and generating a first feature map;
Inputting the initial image into the detail extraction branch network to perform low-dimensional feature extraction, and generating a second feature map;
inputting the first feature map to the instance positioning branch network to position a foreground instance and a background region, and generating a third feature map;
inputting the first feature map to the channel attention branch network, distributing weights for all channels of the first feature map, and generating a fourth feature map;
inputting the second characteristic diagram and the fourth characteristic diagram into the characteristic aggregation branch network for fusion to generate a fifth characteristic diagram;
inputting the fifth feature map to the feature coding network for coding to generate a sixth feature map;
adding the third characteristic diagram and the sixth characteristic diagram pixel by pixel to generate a seventh characteristic diagram;
and inputting the seventh feature map to the post-processing module so as to fuse all channels in the seventh feature map and obtain a panoramic segmentation result of the initial image.
Preferably, the backbone network includes: a FPN network based on ResNet, a convolution module C1, a convolution module C2, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3;
The convolution module C1 and the convolution module C2 each include: a convolution layer, a batch normalization layer, and an activation function;
the up-sampling module U1, the up-sampling module U2 and the up-sampling module U3 all include: a convolution layer, a batch normalization layer, an activation function, and a 2-up sampling layer;
the convolution kernel of the convolution layer is 3×3, and the input channel and the output channel of the convolution layer are 256.
Preferably, the detail extraction branch network includes: convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9 and convolution module C10;
the convolution modules C3, C4, C5, C6, C7, C8, C9 and C10 include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of the convolution module C3 are 3×3, 3, 64, 2 and 1/2 of the resolution of the initial image respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C4 are 3×3, 64 and 1 respectively;
The convolution kernel, input channel, output channel, step size and resolution of the convolution layer of the convolution module C5 are 3×3, 64, 2 and 1/2 of the resolution of the initial image respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C6 are 3×3, 64 and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C7 are 3×3, 64 and 1 respectively;
the convolution kernel, input channel, output channel and step size and resolution of the convolution layer of the convolution module C8 are 3×3, 64, 128, 2 and 1/2 of the resolution of the initial image respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C9 are 3×3, 128, 256 and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C10 are 3×3, 256 and 1, respectively.
Preferably, the example positioning branch network includes: convolution module C11, convolution module C12, convolution module C13, coordConv layer, convolution module C14, convolution module C15, and convolution module C16;
the convolution modules C11, C12, C13, C14, C15 and C16 include: a convolution layer, a batch normalization layer, and an activation function;
The convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C11 are 3×3, 256 and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C12 are 3×3, 256 and 1 respectively;
the convolution kernel, the input channel, the output channel and the step length of the convolution layer of the convolution module C13 are 3 multiplied by 3, 256, M+N and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C14 are 3×3, 256 and 1 respectively;
the convolution kernel, input channel, output channel and step length of the convolution layer of the convolution module C15 are 3 multiplied by 3, 256, M+N and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C16 are 3×3, m+ N, M +n and 1, respectively.
Preferably, the channel attention branching network includes: global average pooling layer, full connection layer FC1, full connection layer FC2, and Sigmoid layer;
the input channel and the output channel of the full connection layer FC1 are 256 and 16 respectively;
the input channel and the output channel of the fully connected layer FC2 are 16 and 256, respectively.
Preferably, the feature aggregation branch network includes: channel-by-channel convolution DWC1, convolution module C17, convolution module C18, UP-sampling layer UP1, convolution module C19, sigmoid layer, channel-by-channel convolution DWC2, convolution module C20, sigmoid layer, UP-sampling layer UP2, and convolution module C21;
The convolution modules C17, C18, C19, C20 and C21 include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step length, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC1 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC2 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C17 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C18 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C19 are respectively: 3×3, 2, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C20 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C21 are respectively: 3×3, 1, 256, and 256.
Preferably, the feature encoding network includes: a Coordcon layer, a convolution module C22, a convolution module C23 and a convolution module C24;
the convolution module C22, the convolution module C23, and the convolution module C24 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C22 are respectively: 3×3, 1, 256, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C23 are respectively: 3×3, 1, 128, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C24 are respectively: 3×3, 1, 128, and m+n.
Preferably, the inputting the initial image into the backbone network for high-dimensional feature extraction, generating a first feature map includes:
inputting the initial image to a FPN network based on ResNet for convolution processing to generate a feature map 11, a feature map 12, a feature map 13, a feature map 14 and a feature map 15; the resolutions of the feature images 11 to 15 are respectively 1/4, 1/8, 1/16, 1/32 and 1/64 of the resolution of the initial image, and the channel numbers of the feature images 11 to 15 are 256;
The feature map 11, the feature map 12, the feature map 13 and the feature map 14 are respectively input into a convolution module C1, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3 to unify the channel numbers and the resolutions of the feature map 11 to the feature map 14, so as to generate a feature map 11a, a feature map 12a, a feature map 13a and a feature map 14a; the resolution of the feature images 11 a-14 a is 1/4 of the resolution of the initial image, and the channel numbers of the feature images 11 a-14 a are 256;
adding the 11a to the feature map 14a pixel by pixel to generate a feature map 16;
inputting the feature map 16 to a convolution module C2 for feature coding to generate the first feature map; the resolution of the first feature map is 1/4 of the resolution of the initial image, and the channel number of the first feature map is 256.
Preferably, the inputting the initial image to the detail extraction branch network to perform low-dimensional feature extraction, and generating a second feature map includes:
inputting the initial image into a convolution module C3, a convolution module C4, a convolution module C5, a convolution module C6, a convolution module C7, a convolution module C8, a convolution module C9 and a convolution module C10 which are connected in sequence to perform low-dimensional feature extraction, and generating the second feature map; the resolution of the second characteristic diagram is 1/8 of the resolution of the initial image, and the channel number of the second characteristic diagram is 256.
Preferably, the inputting the first feature map into the instance positioning branch network to perform positioning of a foreground instance and a background region, and generating a third feature map includes:
inputting the first feature map to a convolution module C11, a convolution module C12 and a convolution module C13 which are connected in sequence to carry out convolution operation, and generating a position feature map; the channel number of the position feature map is M+N, M is the category number of the foreground of the initial image, and different examples in the same foreground category correspond to different example numbers; n is the number of categories of the background of the initial image;
inputting the first feature map to a Coordcon layer, a convolution module C14 and a convolution module C15 which are connected in sequence to carry out convolution operation, and generating a position weight map; the number of channels of the position weight graph is M+N;
multiplying the position feature map and the position weight map pixel by pixel to generate a feature map 17;
inputting the feature map 17 to a convolution module C16 for encoding to generate the third feature map; and the channel number of the third characteristic diagram is M+N.
Preferably, the inputting the first feature map to the channel attention branching network assigns weights to the channels of the first feature map, and generates a fourth feature map, including:
Inputting the first feature map to a global average pooling layer, a full connection layer FC1, a full connection layer FC2 and a Sigmoid layer which are sequentially connected so as to compress and expand the number of channels of the first feature map and generate a feature map 18;
multiplying the first feature map and the feature map 18 pixel by pixel to generate the fourth feature map; the number of channels of the fourth feature map is 256, and the resolution of the fourth feature map is 1/4 of the resolution of the initial image.
Preferably, the inputting the second feature map and the fourth feature map into the feature aggregation branch network to perform fusion, to generate a fifth feature map, includes:
inputting the second feature map to a channel-by-channel convolution DWC1 and a convolution module C17 which are connected in sequence to carry out convolution operation, and generating a feature map 19; the resolution of the feature map 19 is 1/8 of the resolution of the initial image;
inputting the second feature map to a convolution module C18 and an UP-sampling layer UP1 which are connected in sequence to UP-sample the resolution, and generating a feature map 20; the resolution of the feature map 20 is 1/4 of the resolution of the initial image;
inputting the fourth feature map to a convolution module C19 and a Sigmoid layer which are connected in sequence to normalize the pixel value of the fourth feature map to be between 0 and 1, and generating a feature map 21; the resolution of the feature map 21 is 1/8 of the resolution of the initial image;
Inputting the fourth feature map to a channel-by-channel convolution DWC2, a convolution module C20 and a Sigmoid layer which are connected in sequence so as to normalize pixel values of the fourth feature map to be between [0,1] and generate a feature map 22; the resolution of the feature map 22 is 1/4 of the resolution of the initial image;
multiplying the feature map 19 and the feature map 21 pixel by pixel to generate a feature map 23; the resolution of the feature map 23 is 1/8 of the resolution of the initial image;
inputting the feature map 23 into an UP-sampling layer UP2 to UP-sample the resolution, and generating a feature map 25; the resolution of the feature map 25 is 1/4 of the resolution of the initial image;
multiplying the feature map 20 and the feature map 22 pixel by pixel to generate a feature map 24; the resolution of the feature map 24 is 1/4 of the resolution of the initial image;
adding the feature map 24 and the feature map 25 pixel by pixel to generate a feature map 26; the resolution of the feature map 26 is 1/4 of the resolution of the initial image;
inputting the feature map 26 to a convolution module C21 for encoding to generate the fifth feature map; the number of channels and the resolution of the fifth feature map are 256 and 1/4 of the resolution of the initial image, respectively.
Preferably, the inputting the fifth feature map to the feature encoding network to encode, to generate a sixth feature map includes:
inputting the fifth characteristic diagram into a Coordcon layer, a convolution module C22, a convolution module C23 and a convolution module C24 which are sequentially connected to perform convolution calculation, so as to generate the sixth characteristic diagram; the number of channels of the sixth feature map is M+N, the resolution of the sixth feature map is 1/4 of the resolution of the initial image, M is the number of categories of the foreground of the initial image, and N is the number of categories of the background of the initial image.
Preferably, the inputting the seventh feature map to the post-processing module, so as to fuse each channel in the seventh feature map, to obtain a panoramic segmentation result of the initial image, includes:
performing 4 times of up-sampling operation on the seventh feature map to obtain a feature map 27;
normalizing the pixel values corresponding to the pixel points in the feature map 27 to be between 0 and 1 by using a sigmoid function to obtain a feature map 28;
the pixel point category corresponding to the maximum pixel value of the same pixel point in each channel is taken as the final category of the pixel point, a characteristic diagram 29 is obtained, and the characteristic diagram 29 is the panoramic segmentation result of the initial image; the number of channels of the feature map 29 is 1;
Wherein the pixel point categories are M+N; m is the number of foreground classes of the initial image, and different examples in the same foreground class correspond to different example numbers; and N is the category number of the background of the initial image.
In a second aspect, there is provided a panorama segmentation system based on multi-branch feature extraction, the system comprising:
the preprocessing module is used for preprocessing the RGB image to obtain an initial image;
the panoramic segmentation module is used for inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image;
the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network, and a post-processing module.
The technical scheme provided by the invention has at least one or more of the following beneficial effects:
the invention provides a panorama segmentation method and a system based on multi-branch feature extraction, comprising the following steps: preprocessing an RGB image to obtain an initial image; inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image; the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network, and a post-processing module. The panoramic segmentation network provided by the invention not only can accurately identify different foreground objects and backgrounds, but also can accurately segment the edges of the objects; and rich receptive fields and spatial information are extracted by fusing high-dimensional features and low-dimensional features of the image, so that the overall accuracy of panoramic segmentation is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a panorama segmentation method based on multi-branch feature extraction provided by an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a panorama splitting network according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a backbone network according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a detail extraction branch network according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an example positioning branch network provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a channel attention branching network according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a feature aggregation branch network according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a feature encoding network according to an embodiment of the present invention;
Fig. 9 is a schematic diagram of feature map 27 provided by an embodiment of the present invention.
Fig. 10 is a schematic structural diagram of a panoramic segmentation system based on multi-branch feature extraction according to an embodiment of the present invention.
Detailed Description
The following describes the embodiments of the present invention in further detail with reference to the drawings.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As disclosed in the background art, panoramic segmentation is complete analysis of an image, the image can be segmented into a foreground and a background, and different numbers are assigned to different examples of foreground objects, so that each pixel in the image has an independent semantic tag and an independent example number, and pixel categories are not overlapped, thereby facilitating downstream tasks.
The existing panoramic segmentation frame is divided into two types, one is a two-branch structure for processing foreground and background separately, and the other is an end-to-end structure for processing foreground and background simultaneously. In the two-branch structure, one branch is responsible for predicting an instance structure of a foreground part, and is equivalent to instance segmentation; the other branch is responsible for distinguishing the background from the foreground, and is equivalent to semantic segmentation, and the processing results of the two branches are fused to obtain a panoramic segmentation result. However, the panorama segmentation network with the two-branch structure needs to perform semantic segmentation and instance segmentation respectively, and a special fusion scheme is required to be designed aiming at the processing results of the two branches, so that the network structure is redundant, and the panorama segmentation precision is not high. The end-to-end structure uniformly processes the background and the foreground, and the fusion step of the example segmentation and the semantic segmentation results is eliminated. But the end-to-end panoramic segmentation network processes the foreground and the background simultaneously, can not integrate low-dimensional features and high-dimensional features, and has unclear object detail and edge segmentation results.
In order to improve the problems, the problems of low segmentation accuracy of images and unclear segmentation at detail edges in the prior art are solved.
The above-described scheme is explained in detail below.
Example 1
The invention provides a panorama segmentation method based on multi-branch feature extraction, as shown in fig. 1, the method comprises the following steps:
step 101: preprocessing an RGB image to obtain an initial image;
step 102: inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image;
the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network, and a post-processing module.
Further, preprocessing the RGB image to obtain an initial image, including:
the resolution of the RGB image was normalized to 512 x 1024, resulting in an initial image.
It will be appreciated that by unifying the resolution of the RGB image to 512 x 1024 before panoramic segmentation of the RGB image, it can be applied to a variety of input images.
Further, referring to fig. 2, step 102 may be implemented by, but is not limited to, the following:
step 1021: inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image, wherein the panoramic segmentation result comprises the following steps:
step 1022: inputting the initial image into a backbone network for high-dimensional feature extraction, and generating a first feature map;
Step 1023: inputting the initial image into a detail extraction branch network to perform low-dimensional feature extraction, and generating a second feature map;
step 1024: inputting the first feature map into an example positioning branch network to position a foreground example and a background area, and generating a third feature map;
step 1025: inputting the first feature map to a channel attention branch network, distributing weights for all channels of the first feature map, and generating a fourth feature map;
step 1026: inputting the second feature map and the fourth feature map into a feature aggregation branch network for fusion to generate a fifth feature map;
step 1027: inputting the fifth feature map to a feature coding network for coding to generate a sixth feature map;
step 1028: adding the third feature map and the sixth feature map pixel by pixel to generate a seventh feature map;
step 1029: and inputting the seventh feature map to a post-processing module so as to fuse all channels in the seventh feature map and obtain a panoramic segmentation result of the initial image.
It should be noted that, compared with the prior art, the panoramic segmentation method based on multi-branch feature extraction provided by the invention has the advantages that the edge contour of the panoramic segmentation result is clearer, and the overall segmentation precision is higher; the panoramic segmentation network is composed of a trunk network, a detail extraction branch network, an example positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature coding network and a post-processing module, and the foreground example and the background area are directly segmented in an end-to-end mode; the invention adopts multi-layer shallow channel convolution and few-layer deep channel convolution to extract high-dimensional features and low-dimensional features, thereby improving the positioning of a foreground instance and a background and the respective segmentation precision thereof, and further improving the overall segmentation precision of panoramic segmentation.
Further, the backbone network includes: a FPN network based on ResNet, a convolution module C1, a convolution module C2, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3;
the convolution modules C1 and C2 each include: a convolution layer, a batch normalization layer, and an activation function;
the upsampling module U1, the upsampling module U2, and the upsampling module U3 each include: a convolution layer, a batch normalization layer, an activation function, and a 2-up sampling layer;
the convolution kernel of the convolution layer is 3×3, and both the input channel and the output channel of the convolution layer are 256.
Further, the detail extraction branch network includes: convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9 and convolution module C10;
convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9, and convolution module C10 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of convolution module C3 are 3 x 3, 64, 2 and 1/2 of the resolution of the initial image, respectively;
The convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C4 are 3 x 3, 64, and 1, respectively;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of convolution module C5 are 3 x 3, 64, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C6 are 3×3, 64 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C7 are 3×3, 64 and 1, respectively;
the convolution kernel, input channel, output channel and step size and resolution of the convolution layer of convolution module C8 are 3 x 3, 64, 128, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C9 are 3 x 3, 128, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C10 are 3 x 3, 256 and 1, respectively.
Further, the example positioning branch network includes: convolution module C11, convolution module C12, convolution module C13, coordConv layer, convolution module C14, convolution module C15, and convolution module C16;
Convolution modules C11, C12, C13, C14, C15, and C16 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C11 are 3×3, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C12 are 3×3, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C13 are 3×3, 256, M+N and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C14 are 3 x 3, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C15 are 3×3, 256, m+n and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C16 are 3 x 3, m+ N, M +n and 1, respectively.
Further, the channel attention branching network includes: global average pooling layer, full connection layer FC1, full connection layer FC2, and Sigmoid layer;
the input channel and the output channel of the full connection layer FC1 are 256 and 16 respectively;
The input channels and output channels of the fully connected layer FC2 are 16 and 256, respectively.
Further, the feature aggregation branch network includes: channel-by-channel convolution DWC1, convolution module C17, convolution module C18, UP-sampling layer UP1, convolution module C19, sigmoid layer, channel-by-channel convolution DWC2, convolution module C20, sigmoid layer, UP-sampling layer UP2, and convolution module C21;
convolution modules C17, C18, C19, C20, and C21 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step size, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC1 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step size, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC2 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C17 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C18 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C19 are respectively: 3×3, 2, 256, and 256;
The convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C20 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C21 are respectively: 3×3, 1, 256, and 256.
Further, the feature encoding network includes: a Coordcon layer, a convolution module C22, a convolution module C23 and a convolution module C24;
the convolution modules C22, C23, and C24 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C22 are respectively: 3×3, 1, 256, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C23 are respectively: 3×3, 1, 128, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C24 are respectively: 3×3, 1, 128, and m+n.
Further, as shown in fig. 3, inputting the initial image into the backbone network for high-dimensional feature extraction, generating a first feature map includes:
inputting the initial image into a FPN network based on ResNet for convolution processing to generate a feature map 11, a feature map 12, a feature map 13, a feature map 14 and a feature map 15; the resolutions of the feature images 11 to 15 are 1/4, 1/8, 1/16, 1/32, and 1/64 of the resolution of the initial image, and the channel numbers of the feature images 11 to 15 are 256;
Inputting the feature map 11, the feature map 12, the feature map 13 and the feature map 14 to a convolution module C1, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3 respectively to unify the channel numbers and the resolutions of the feature map 11 to the feature map 14 and generate a feature map 11a, a feature map 12a, a feature map 13a and a feature map 14a; the resolution of each of the feature images 11a to 14a is 1/4 of the resolution of the initial image, and the number of channels of each of the feature images 11a to 14a is 256;
adding 11a to the feature map 14a pixel by pixel to generate a feature map 16;
inputting the feature map 16 to a convolution module C2 for feature coding to generate a first feature map; the resolution of the first feature map is 1/4 of the resolution of the initial image, and the number of channels of the first feature map is 256.
In some embodiments, the ResNet referred to herein can be, but is not limited to, resNet50 or ResNet101, etc., i.e., resNet 50-based FPN networks or ResNet 101-based FPN networks.
It should be noted that, the backbone network adopts multi-layer shallow channel convolution (namely, resNet layer (50 layers or 101 layers, etc.), FPN layer (8 layers), convolution module C1 layer, convolution module C2 layer, up-sampling module U1 layer, sum of two up-sampling module U2 layers and three up-sampling module U3 layers) to extract the high-dimensional characteristics of the initial image, so that the channel number increases slowly, and the feature images of each stage and different sampling rates in the network are fused, so that the spatial characteristics of the image can be fully extracted, and the accuracy of pixel classification is improved.
Further, as shown in fig. 4, inputting the initial image into a detail extraction branch network to perform low-dimensional feature extraction, and generating a second feature map includes:
inputting the initial image into a convolution module C3, a convolution module C4, a convolution module C5, a convolution module C6, a convolution module C7, a convolution module C8, a convolution module C9 and a convolution module C10 which are connected in sequence to perform low-dimensional feature extraction, and generating a second feature map; the resolution of the second feature map is 1/8 of the resolution of the initial image, and the number of channels of the second feature map is 256.
It should be noted that the detail extraction branch only includes 8 convolution layers (namely, convolution modules C3-C10), the number of channels is rapidly extended from 3 dimensions of the input image to 256 dimensions, and edge details of objects in the image can be fully extracted through the network structure of the deep channels with fewer layers, so that the panoramic segmentation effect is improved.
Further, as shown in fig. 5, inputting the first feature map into an instance positioning branch network to perform positioning of a foreground instance and a background region, and generating a third feature map includes:
inputting the first feature map into a convolution module C11, a convolution module C12 and a convolution module C13 which are connected in sequence to carry out convolution operation, and generating a position feature map; the channel number of the position feature map is M+N, M is the category number of the foreground of the initial image, and different examples in the same foreground category correspond to different example numbers; n is the number of categories of the background of the initial image;
Inputting the first feature map to a CoordConv layer, a convolution module C14 and a convolution module C15 which are connected in sequence to carry out convolution operation, and generating a position weight map; the number of channels of the position weight graph is M+N;
multiplying the position feature map and the position weight map pixel by pixel to generate a feature map 17;
inputting the feature map 17 to a convolution module C16 for encoding to generate a third feature map; the number of channels of the third feature map is m+n.
It should be noted that the position feature map includes m+n channels, each of which predicts the position of the geometric center of the corresponding foreground example object or background structure on the image, respectively; for the background structure, N channels respectively correspond to N categories; for the foreground structure, M channels correspond to M categories, each channel further comprises different individuals of the same category, and different instance numbers are allocated to different individuals of the same category; for example, a category corresponding to a channel is a table, but the channel contains two different tables: white table and red table, then the white table and red table are assigned different numbers: a table 1 and a table 2.
The position weight map is a position feature map corresponding to the corresponding channel, and the position weight map of the same channel predicts weight coefficients for each pixel of the position feature map and is used for representing the reliability of position prediction in the position feature map.
Further, as shown in fig. 6, inputting the first feature map to the channel attention branching network, and assigning weights to the channels of the first feature map to generate a fourth feature map, including:
inputting the first feature map to a global average pooling layer (GPooling), a full connection layer FC1, a full connection layer FC2 and a Sigmoid layer which are sequentially connected so as to compress and expand the number of channels of the first feature map, and generating a feature map 18;
performing pixel-by-pixel multiplication on the first characteristic diagram and the characteristic diagram 18 to generate a fourth characteristic diagram; the number of channels of the fourth feature map is 256, and the resolution of the fourth feature map is 1/4 of the resolution of the initial image.
It should be noted that Sigmoid is a normalization operation, i.e. the weight of each channel is obtained. Compressing and expanding the channel number of the first feature map, namely compressing the channel number after the first feature map is output from the global average pooling layer and enters the full connection layer FC1 because the input channel and the output channel of the full connection layer FC1 are respectively 256 and 16 and the input channel and the output channel of the full connection layer FC2 are respectively 16 and 256; after being output from the full connection layer FC1 and entering the full connection layer FC2, the number of channels is expanded.
It can be understood that the channel attention branches do not change the structure of the input data, the weight (i.e. the importance degree) of each channel of the input data is obtained through an automatic learning mode of the neural network, and the useful characteristics are promoted and invalid characteristics are restrained by means of the weight, so that the characterization capability of the network is improved.
Further, as shown in fig. 7, inputting the second feature map and the fourth feature map into the feature aggregation branch network for fusion, and generating a fifth feature map includes:
inputting the second feature map to a channel-by-channel convolution (Depthwise Convolution) DWC1 and a convolution module C17 which are connected in sequence to carry out convolution operation, and generating a feature map 19; the resolution of the feature map 19 is 1/8 of the resolution of the original image;
inputting the second feature map to a convolution module C18 and an UP-sampling layer (UP 1) which are connected in sequence to perform UP-sampling on the resolution, and generating a feature map 20; the resolution of the feature map 20 is 1/4 of the resolution of the original image;
inputting the fourth feature map to a convolution module C19 and a Sigmoid layer which are connected in sequence to normalize the pixel value of the fourth feature map to be between [0,1] to generate a feature map 21; the resolution of the feature map 21 is 1/8 of the resolution of the original image;
inputting the fourth feature map to the channel-by-channel convolution DWC2, the convolution module C20 and the Sigmoid layer which are connected in sequence so as to normalize the pixel value of the fourth feature map to be between [0,1] and generate a feature map 22; the resolution of the feature map 22 is 1/4 of the resolution of the original image;
Performing pixel-by-pixel multiplication on the feature map 19 and the feature map 21 to generate a feature map 23; the resolution of the feature map 23 is 1/8 of the resolution of the original image;
inputting the feature map 23 into an UP-sampling layer UP2 to UP-sample the resolution, and generating a feature map 25; the resolution of the feature map 25 is 1/4 of the resolution of the original image;
multiplying the feature map 20 and the feature map 22 pixel by pixel to generate a feature map 24; the resolution of the feature map 24 is 1/4 of the resolution of the original image;
adding the feature map 24 and the feature map 25 pixel by pixel to generate a feature map 26; the resolution of the feature map 26 is 1/4 of the resolution of the original image;
inputting the feature map 26 to a convolution module C21 for encoding to generate a fifth feature map; the number of channels and the resolution of the fifth feature map are 256 and 1/4 of the resolution of the initial image, respectively.
It can be understood that the feature aggregation network aggregates the feature graphs of different branches together, so that the resolution and the channel number of the feature aggregation network are unified; the fourth characteristic diagram is derived from the network characteristics of the multi-layer shallow channels, the second characteristic diagram is derived from the network characteristics of the few-layer deep channels, and the segmentation capability of the whole network to the space characteristics and the edge details is improved by integrating the characteristics of the two parts of networks.
Further, as shown in fig. 8, inputting the fifth feature map into the feature encoding network to encode, and generating a sixth feature map includes:
inputting the fifth characteristic diagram into a CoordConv layer, a convolution module C22, a convolution module C23 and a convolution module C24 which are sequentially connected to perform convolution calculation to generate a sixth characteristic diagram; the number of channels of the sixth feature map is M+N, the resolution of the sixth feature map is 1/4 of the resolution of the initial image, M is the number of categories of the foreground of the initial image, and N is the number of categories of the background of the initial image.
The feature encoding network continues to calculate the input features through the convolution layer, where the CoordConv layer helps to improve positioning of each instance in the image, and the convolution layer further learns the features of the image and reduces the number of channels, so as to facilitate calculation with the third feature map.
Further, inputting the seventh feature map to a post-processing module to fuse each channel in the seventh feature map to obtain a panoramic segmentation result of the initial image, where the method includes:
performing 4 times of up-sampling operation on the seventh feature map to obtain a feature map 27;
normalizing the pixel values corresponding to the pixel points in the feature map 27 to be between 0 and 1 by using a sigmoid function to obtain a feature map 28;
The pixel point category corresponding to the maximum pixel value of the same pixel point in each channel is taken as the final category of the pixel point, a characteristic diagram 29 is obtained, and the characteristic diagram 29 is the panoramic segmentation result of the initial image; the number of channels of feature map 29 is 1;
wherein the pixel point types are M+N types; m is the number of foreground classes of the initial image, and different examples in the same foreground class correspond to different example numbers; n is the number of categories of the background of the initial image.
Note that, in the above description, the number of channels of the third feature map and the sixth feature map is m+n, and the seventh feature map is a result of pixel-by-pixel addition of the third feature map and the sixth feature map, so the number of channels of the seventh feature map is also m+n, and the number of channels of the feature map 27 and the feature map 28 is also m+n;
the process of obtaining the feature map 29 is to fuse all channels in the feature map 28, that is, the pixel point category corresponding to the maximum pixel value of the same pixel point in each channel is the final category of the pixel point, finally obtain the feature map 29 of a single channel, and determine the category corresponding to each pixel point in the single channel.
It can be understood that the pixel point category and different examples in the same foreground category correspond to different example numbers, which are obtained when the example positioning branch network performs positioning on the foreground examples and the background areas of the first feature map, that is, the position feature map includes m+n channels, each channel predicts the position of the pixel point of the corresponding foreground example object or background structure on the image, and for the foreground structure, M channels correspond to M categories, each channel includes different individuals in the same category, and different example numbers are allocated to different individuals in the same category (such as the table 1 and the table 2 mentioned above).
The 4-fold upsampling operation is: the seventh feature map has a resolution of 1/4 and the segmentation result needs to be transformed to the same resolution as the input image by an upsampling operation.
For example, as shown in fig. 9, assume that the feature map 27 has three channels, and the pixel values corresponding to the pixels in the first channel are shown in the figure, and the descriptions of the pixel values corresponding to the pixels in the second channel and the third channel are omitted;
normalizing the pixel values corresponding to the pixel points in the feature map 27 to be between 0 and 1 by using a sigmoid function to obtain a feature map 28; that is, the pixel value corresponding to each pixel point in each channel is between [0,1 ];
each pixel point corresponds to one pixel value in the three channels, so that the pixel point category corresponding to the largest pixel value is the final category of the pixel point, and further a single-channel characteristic diagram 29 is obtained; for example, assuming that the pixel values corresponding to the pixel point a in the three channels are 0.2, 0.3 and 0.5, respectively, and the pixel point categories corresponding to the pixel point a in the three channels are a table, a chair and a cup, respectively, the pixel point category cup corresponding to the pixel value 0.5 is the final category of the pixel point a.
The invention provides a panoramic segmentation method based on multi-branch feature extraction, which is characterized in that an initial image is obtained by preprocessing an RGB image, a panoramic segmentation result of the initial image is obtained by inputting the initial image into a pre-constructed panoramic segmentation network, and the pre-constructed panoramic segmentation network comprises: the system comprises a main network, a detail extraction branch network, an example positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature coding network and a post-processing module, wherein different foreground objects and backgrounds can be accurately identified, object edges can be accurately segmented, and the segmentation effect of the object edges can be effectively improved; by fusing the high-dimensional features and the low-dimensional features of the image, rich receptive fields and spatial information are extracted, the overall accuracy of panoramic segmentation is improved, and technical support is provided for downstream tasks.
Example two
In order to cooperate with the panorama segmentation method based on multi-branch feature extraction provided by the above embodiment, the present invention further provides a panorama segmentation system based on multi-branch feature extraction, as shown in fig. 10, the system includes:
the preprocessing module is used for preprocessing the RGB image to obtain an initial image;
the panoramic segmentation module is used for inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image;
the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network, and a post-processing module.
Further, the preprocessing module is specifically configured to:
the resolution of the RGB image was normalized to 512 x 1024, resulting in an initial image.
Further, the panorama segmentation module includes:
the first generation unit is used for inputting the initial image into a backbone network for high-dimensional feature extraction and generating a first feature map;
the second generation unit is used for inputting the initial image into the detail extraction branch network to perform low-dimensional feature extraction and generating a second feature map;
The third generating unit is used for inputting the first feature map into an example positioning branch network to position a foreground example and a background area and generating a third feature map;
the fourth generation unit is used for inputting the first feature map to the channel attention branch network, distributing weights for all channels of the first feature map and generating a fourth feature map;
the fifth generation unit is used for inputting the second feature map and the fourth feature map into the feature aggregation branch network for fusion to generate a fifth feature map;
the sixth generation unit is used for inputting the fifth characteristic diagram into the characteristic coding network to be coded, and generating a sixth characteristic diagram;
a seventh generating unit, configured to add the third feature map and the sixth feature map pixel by pixel, to generate a seventh feature map;
and the eighth generating unit is used for inputting the seventh characteristic diagram to the post-processing module so as to fuse all channels in the seventh characteristic diagram and obtain a panoramic segmentation result of the initial image.
Further, the backbone network includes: a FPN network based on ResNet, a convolution module C1, a convolution module C2, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3;
the convolution modules C1 and C2 each include: a convolution layer, a batch normalization layer, and an activation function;
The upsampling module U1, the upsampling module U2, and the upsampling module U3 each include: a convolution layer, a batch normalization layer, an activation function, and a 2-up sampling layer;
the convolution kernel of the convolution layer is 3×3, and both the input channel and the output channel of the convolution layer are 256.
Further, the detail extraction branch network includes: convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9 and convolution module C10;
convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9, and convolution module C10 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of convolution module C3 are 3 x 3, 64, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C4 are 3 x 3, 64, and 1, respectively;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of convolution module C5 are 3 x 3, 64, 2 and 1/2 of the resolution of the initial image, respectively;
The convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C6 are 3×3, 64 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C7 are 3×3, 64 and 1, respectively;
the convolution kernel, input channel, output channel and step size and resolution of the convolution layer of convolution module C8 are 3 x 3, 64, 128, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C9 are 3 x 3, 128, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C10 are 3 x 3, 256 and 1, respectively.
Further, the example positioning branch network includes: convolution module C11, convolution module C12, convolution module C13, coordConv layer, convolution module C14, convolution module C15, and convolution module C16;
convolution modules C11, C12, C13, C14, C15, and C16 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C11 are 3×3, 256 and 1, respectively;
The convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C12 are 3×3, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C13 are 3×3, 256, M+N and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C14 are 3 x 3, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C15 are 3×3, 256, m+n and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C16 are 3 x 3, m+ N, M +n and 1, respectively.
Further, the channel attention branching network includes: global average pooling layer, full connection layer FC1, full connection layer FC2, and Sigmoid layer;
the input channel and the output channel of the full connection layer FC1 are 256 and 16 respectively;
the input channels and output channels of the fully connected layer FC2 are 16 and 256, respectively.
Further, the feature aggregation branch network includes: channel-by-channel convolution DWC1, convolution module C17, convolution module C18, UP-sampling layer UP1, convolution module C19, sigmoid layer, channel-by-channel convolution DWC2, convolution module C20, sigmoid layer, UP-sampling layer UP2, and convolution module C21;
Convolution modules C17, C18, C19, C20, and C21 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step size, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC1 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step size, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC2 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C17 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C18 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C19 are respectively: 3×3, 2, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C20 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C21 are respectively: 3×3, 1, 256, and 256.
Further, the feature encoding network includes: a Coordcon layer, a convolution module C22, a convolution module C23 and a convolution module C24;
The convolution modules C22, C23, and C24 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C22 are respectively: 3×3, 1, 256, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C23 are respectively: 3×3, 1, 128, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C24 are respectively: 3×3, 1, 128, and m+n.
Further, the first generating unit is specifically configured to:
inputting the initial image into a FPN network based on ResNet for convolution processing to generate a feature map 11, a feature map 12, a feature map 13, a feature map 14 and a feature map 15; the resolutions of the feature images 11 to 15 are 1/4, 1/8, 1/16, 1/32, and 1/64 of the resolution of the initial image, and the channel numbers of the feature images 11 to 15 are 256;
inputting the feature map 11, the feature map 12, the feature map 13 and the feature map 14 to a convolution module C1, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3 respectively to unify the channel numbers and the resolutions of the feature map 11 to the feature map 14 and generate a feature map 11a, a feature map 12a, a feature map 13a and a feature map 14a; the resolution of each of the feature images 11a to 14a is 1/4 of the resolution of the initial image, and the number of channels of each of the feature images 11a to 14a is 256;
Adding 11a to the feature map 14a pixel by pixel to generate a feature map 16;
inputting the feature map 16 to a convolution module C2 for feature coding to generate a first feature map; the resolution of the first feature map is 1/4 of the resolution of the initial image, and the number of channels of the first feature map is 256.
Further, the second generating unit is specifically configured to:
inputting the initial image into a convolution module C3, a convolution module C4, a convolution module C5, a convolution module C6, a convolution module C7, a convolution module C8, a convolution module C9 and a convolution module C10 which are connected in sequence to perform low-dimensional feature extraction, and generating a second feature map; the resolution of the second feature map is 1/8 of the resolution of the initial image, and the number of channels of the second feature map is 256.
Further, the third generating unit is specifically configured to:
inputting the first feature map into a convolution module C11, a convolution module C12 and a convolution module C13 which are connected in sequence to carry out convolution operation, and generating a position feature map; the channel number of the position feature map is M+N, M is the category number of the foreground of the initial image, and different examples in the same foreground category correspond to different example numbers; n is the number of categories of the background of the initial image;
inputting the first feature map to a CoordConv layer, a convolution module C14 and a convolution module C15 which are connected in sequence to carry out convolution operation, and generating a position weight map; the number of channels of the position weight graph is M+N;
Multiplying the position feature map and the position weight map pixel by pixel to generate a feature map 17;
inputting the feature map 17 to a convolution module C16 for encoding to generate a third feature map; the number of channels of the third feature map is m+n.
Further, the fourth generating unit is specifically configured to:
inputting the first feature map to a global average pooling layer, a full connection layer FC1, a full connection layer FC2 and a Sigmoid layer which are sequentially connected so as to compress and expand the channel number of the first feature map and generate a feature map 18;
performing pixel-by-pixel multiplication on the first characteristic diagram and the characteristic diagram 18 to generate a fourth characteristic diagram; the number of channels of the fourth feature map is 256, and the resolution of the fourth feature map is 1/4 of the resolution of the initial image.
Further, the fifth generating unit is specifically configured to:
inputting the second feature map to a channel-by-channel convolution DWC1 and a convolution module C17 which are connected in sequence to carry out convolution operation, and generating a feature map 19; the resolution of the feature map 19 is 1/8 of the resolution of the original image;
inputting the second feature map to a convolution module C18 and an UP-sampling layer UP1 which are connected in sequence to UP-sample the resolution, and generating a feature map 20; the resolution of the feature map 20 is 1/4 of the resolution of the original image;
Inputting the fourth feature map to a convolution module C19 and a Sigmoid layer which are connected in sequence to normalize the pixel value of the fourth feature map to be between [0,1] to generate a feature map 21; the resolution of the feature map 21 is 1/8 of the resolution of the original image;
inputting the fourth feature map to the channel-by-channel convolution DWC2, the convolution module C20 and the Sigmoid layer which are connected in sequence so as to normalize the pixel value of the fourth feature map to be between [0,1] and generate a feature map 22; the resolution of the feature map 22 is 1/4 of the resolution of the original image;
performing pixel-by-pixel multiplication on the feature map 19 and the feature map 21 to generate a feature map 23; the resolution of the feature map 23 is 1/8 of the resolution of the original image;
inputting the feature map 23 into an UP-sampling layer UP2 to UP-sample the resolution, and generating a feature map 25; the resolution of the feature map 25 is 1/4 of the resolution of the original image;
multiplying the feature map 20 and the feature map 22 pixel by pixel to generate a feature map 24; the resolution of the feature map 24 is 1/4 of the resolution of the original image;
adding the feature map 24 and the feature map 25 pixel by pixel to generate a feature map 26; the resolution of the feature map 26 is 1/4 of the resolution of the original image;
inputting the feature map 26 to a convolution module C21 for encoding to generate a fifth feature map; the number of channels and the resolution of the fifth feature map are 256 and 1/4 of the resolution of the initial image, respectively.
Further, the sixth generating unit is specifically configured to:
inputting the fifth characteristic diagram into a CoordConv layer, a convolution module C22, a convolution module C23 and a convolution module C24 which are sequentially connected to perform convolution calculation to generate a sixth characteristic diagram; the number of channels of the sixth feature map is M+N, the resolution of the sixth feature map is 1/4 of the resolution of the initial image, M is the number of categories of the foreground of the initial image, and N is the number of categories of the background of the initial image.
Further, the eighth generating unit is specifically configured to:
performing 4 times of up-sampling operation on the seventh feature map to obtain a feature map 27;
normalizing the pixel values corresponding to the pixel points in the feature map 27 to be between 0 and 1 by using a sigmoid function to obtain a feature map 28;
the pixel point category corresponding to the maximum pixel value of the same pixel point in each channel is taken as the final category of the pixel point, a characteristic diagram 29 is obtained, and the characteristic diagram 29 is the panoramic segmentation result of the initial image; the number of channels of feature map 29 is 1;
wherein the pixel point categories are M+N; m is the number of foreground classes of the initial image, and different examples in the same foreground class correspond to different example numbers; and N is the category number of the background of the initial image.
The invention provides a panoramic segmentation system based on multi-branch feature extraction, which is characterized in that an RGB image is preprocessed through a preprocessing module to obtain an initial image, the initial image is input into a pre-constructed panoramic segmentation network through the panoramic segmentation module to obtain a panoramic segmentation result of the initial image, and the pre-constructed panoramic segmentation network comprises: the system comprises a main network, a detail extraction branch network, an example positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature coding network and a post-processing module, wherein different foreground objects and backgrounds can be accurately identified, and edges of the objects can be accurately segmented; and rich receptive fields and spatial information are extracted by fusing high-dimensional features and low-dimensional features of the image, so that the overall accuracy of panoramic segmentation is improved.
It can be understood that the system embodiments provided above correspond to the method embodiments described above, and the corresponding specific details may be referred to each other, which is not described herein again.
It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.
Example III
Based on the same inventive concept, the invention also provides a computer device comprising a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application SpecificIntegrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement the corresponding method flow or corresponding functions, to implement the steps of a multi-branch feature extraction-based panorama segmentation method in the above embodiments.
Example IV
Based on the same inventive concept, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the steps of a multi-branch feature extraction-based panorama segmentation method in the above-described embodiments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (14)

1. A panorama segmentation method based on multi-branch feature extraction, the method comprising:
preprocessing an RGB image to obtain an initial image;
inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image;
the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network and a post-processing module;
inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image, wherein the panoramic segmentation result comprises the following steps:
inputting the initial image into the backbone network for high-dimensional feature extraction, and generating a first feature map;
inputting the initial image into the detail extraction branch network to perform low-dimensional feature extraction, and generating a second feature map;
inputting the first feature map to the instance positioning branch network to position a foreground instance and a background region, and generating a third feature map;
inputting the first feature map to the channel attention branch network, distributing weights for all channels of the first feature map, and generating a fourth feature map;
Inputting the second characteristic diagram and the fourth characteristic diagram into the characteristic aggregation branch network for fusion to generate a fifth characteristic diagram;
inputting the fifth feature map to the feature coding network for coding to generate a sixth feature map;
adding the third characteristic diagram and the sixth characteristic diagram pixel by pixel to generate a seventh characteristic diagram;
inputting the seventh feature map to the post-processing module so as to fuse all channels in the seventh feature map and obtain a panoramic segmentation result of the initial image;
the detail extraction branch network comprises: convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9 and convolution module C10;
the convolution modules C3, C4, C5, C6, C7, C8, C9 and C10 include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of the convolution module C3 are 3×3, 3, 64, 2 and 1/2 of the resolution of the initial image respectively;
The convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C4 are 3×3, 64 and 1 respectively;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of the convolution module C5 are 3×3, 64, 2 and 1/2 of the resolution of the initial image respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C6 are 3×3, 64 and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C7 are 3×3, 64 and 1 respectively;
the convolution kernel, input channel, output channel and step size and resolution of the convolution layer of the convolution module C8 are 3×3, 64, 128, 2 and 1/2 of the resolution of the initial image respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C9 are 3×3, 128, 256 and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C10 are 3×3, 256 and 1 respectively;
the feature aggregation branch network comprises: channel-by-channel convolution DWC1, convolution module C17, convolution module C18, UP-sampling layer UP1, convolution module C19, sigmoid layer, channel-by-channel convolution DWC2, convolution module C20, sigmoid layer, UP-sampling layer UP2, and convolution module C21;
The convolution modules C17, C18, C19, C20 and C21 include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step length, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC1 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC2 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C17 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C18 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C19 are respectively: 3×3, 2, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C20 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C21 are respectively: 3×3, 1, 256, and 256.
2. The method of claim 1, wherein preprocessing the RGB image to obtain an initial image comprises:
and normalizing the resolution of the RGB image to 512 x 1024 to obtain the initial image.
3. The method of claim 1, wherein the backbone network comprises: a FPN network based on ResNet, a convolution module C1, a convolution module C2, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3;
the convolution module C1 and the convolution module C2 each include: a convolution layer, a batch normalization layer, and an activation function;
the up-sampling module U1, the up-sampling module U2 and the up-sampling module U3 all include: a convolution layer, a batch normalization layer, an activation function, and a 2-up sampling layer;
the convolution kernel of the convolution layer is 3×3, and the input channel and the output channel of the convolution layer are 256.
4. The method of claim 1, wherein the instance location branch network comprises: convolution module C11, convolution module C12, convolution module C13, coordConv layer, convolution module C14, convolution module C15, and convolution module C16;
the convolution modules C11, C12, C13, C14, C15 and C16 include: a convolution layer, a batch normalization layer, and an activation function;
The convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C11 are 3×3, 256 and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C12 are 3×3, 256 and 1 respectively;
the convolution kernel, the input channel, the output channel and the step length of the convolution layer of the convolution module C13 are 3 multiplied by 3, 256, M+N and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C14 are 3×3, 256 and 1 respectively;
the convolution kernel, input channel, output channel and step length of the convolution layer of the convolution module C15 are 3 multiplied by 3, 256, M+N and 1 respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C16 are 3×3, m+ N, M +n and 1, respectively.
5. The method of claim 1, wherein the channel attention branching network comprises: global average pooling layer, full connection layer FC1, full connection layer FC2, and Sigmoid layer;
the input channel and the output channel of the full connection layer FC1 are 256 and 16 respectively;
the input channel and the output channel of the fully connected layer FC2 are 16 and 256, respectively.
6. The method of claim 1, wherein the feature encoding network comprises: a Coordcon layer, a convolution module C22, a convolution module C23 and a convolution module C24;
The convolution module C22, the convolution module C23, and the convolution module C24 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C22 are respectively: 3×3, 1, 256, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C23 are respectively: 3×3, 1, 128, and 128;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C24 are respectively: 3×3, 1, 128, and m+n.
7. A method according to claim 3, wherein said inputting the initial image into the backbone network for high-dimensional feature extraction, generating a first feature map, comprises:
inputting the initial image to a FPN network based on ResNet for convolution processing to generate a feature map 11, a feature map 12, a feature map 13, a feature map 14 and a feature map 15; the resolutions of the feature images 11 to 15 are respectively 1/4, 1/8, 1/16, 1/32 and 1/64 of the resolution of the initial image, and the channel numbers of the feature images 11 to 15 are 256;
the feature map 11, the feature map 12, the feature map 13 and the feature map 14 are respectively input into a convolution module C1, an up-sampling module U1, two up-sampling modules U2 and three up-sampling modules U3 to unify the channel numbers and the resolutions of the feature map 11 to the feature map 14, so as to generate a feature map 11a, a feature map 12a, a feature map 13a and a feature map 14a; the resolution of the feature images 11 a-14 a is 1/4 of the resolution of the initial image, and the channel numbers of the feature images 11 a-14 a are 256;
Adding the 11a to the feature map 14a pixel by pixel to generate a feature map 16;
inputting the feature map 16 to a convolution module C2 for feature coding to generate the first feature map; the resolution of the first feature map is 1/4 of the resolution of the initial image, and the channel number of the first feature map is 256.
8. The method of claim 1, wherein the inputting the initial image into the detail extraction branch network for low-dimensional feature extraction, generating a second feature map, comprises:
inputting the initial image into a convolution module C3, a convolution module C4, a convolution module C5, a convolution module C6, a convolution module C7, a convolution module C8, a convolution module C9 and a convolution module C10 which are connected in sequence to perform low-dimensional feature extraction, and generating the second feature map; the resolution of the second characteristic diagram is 1/8 of the resolution of the initial image, and the channel number of the second characteristic diagram is 256.
9. The method of claim 4, wherein the inputting the first feature map into the instance location branch network for location of foreground instances and background regions, generating a third feature map, comprises:
Inputting the first feature map to a convolution module C11, a convolution module C12 and a convolution module C13 which are connected in sequence to carry out convolution operation, and generating a position feature map; the channel number of the position feature map is M+N, M is the category number of the foreground of the initial image, and different examples in the same foreground category correspond to different example numbers; n is the number of categories of the background of the initial image;
inputting the first feature map to a Coordcon layer, a convolution module C14 and a convolution module C15 which are connected in sequence to carry out convolution operation, and generating a position weight map; the number of channels of the position weight graph is M+N;
multiplying the position feature map and the position weight map pixel by pixel to generate a feature map 17;
inputting the feature map 17 to a convolution module C16 for encoding to generate the third feature map; and the channel number of the third characteristic diagram is M+N.
10. The method of claim 5, wherein inputting the first profile into the channel attention branching network assigns weights to individual channels of the first profile, generating a fourth profile, comprising:
inputting the first feature map to a global average pooling layer, a full connection layer FC1, a full connection layer FC2 and a Sigmoid layer which are sequentially connected so as to compress and expand the channel number of the first feature map and generate a feature map 18;
Multiplying the first feature map and the feature map 18 pixel by pixel to generate the fourth feature map; the number of channels of the fourth feature map is 256, and the resolution of the fourth feature map is 1/4 of the resolution of the initial image.
11. The method of claim 1, wherein the inputting the second feature map and the fourth feature map into the feature aggregation branch network for fusion, generating a fifth feature map, comprises:
inputting the second feature map to a channel-by-channel convolution DWC1 and a convolution module C17 which are connected in sequence to carry out convolution operation, and generating a feature map 19; the resolution of the feature map 19 is 1/8 of the resolution of the initial image;
inputting the second feature map to a convolution module C18 and an UP-sampling layer UP1 which are connected in sequence to UP-sample the resolution, and generating a feature map 20; the resolution of the feature map 20 is 1/4 of the resolution of the initial image;
inputting the fourth feature map to a convolution module C19 and a Sigmoid layer which are connected in sequence to normalize the pixel value of the fourth feature map to be between 0 and 1, and generating a feature map 21; the resolution of the feature map 21 is 1/8 of the resolution of the initial image;
Inputting the fourth feature map to a channel-by-channel convolution DWC2, a convolution module C20 and a Sigmoid layer which are connected in sequence so as to normalize pixel values of the fourth feature map to be between [0,1] and generate a feature map 22; the resolution of the feature map 22 is 1/4 of the resolution of the initial image;
multiplying the feature map 19 and the feature map 21 pixel by pixel to generate a feature map 23; the resolution of the feature map 23 is 1/8 of the resolution of the initial image;
inputting the feature map 23 into an UP-sampling layer UP2 to UP-sample the resolution, and generating a feature map 25; the resolution of the feature map 25 is 1/4 of the resolution of the initial image;
multiplying the feature map 20 and the feature map 22 pixel by pixel to generate a feature map 24; the resolution of the feature map 24 is 1/4 of the resolution of the initial image;
adding the feature map 24 and the feature map 25 pixel by pixel to generate a feature map 26; the resolution of the feature map 26 is 1/4 of the resolution of the initial image;
inputting the feature map 26 to a convolution module C21 for encoding to generate the fifth feature map; the number of channels and the resolution of the fifth feature map are 256 and 1/4 of the resolution of the initial image, respectively.
12. The method of claim 6, wherein inputting the fifth feature map to the feature encoding network for encoding, generating a sixth feature map, comprises:
inputting the fifth characteristic diagram into a Coordcon layer, a convolution module C22, a convolution module C23 and a convolution module C24 which are sequentially connected to perform convolution calculation, so as to generate the sixth characteristic diagram; the number of channels of the sixth feature map is M+N, the resolution of the sixth feature map is 1/4 of the resolution of the initial image, M is the number of categories of the foreground of the initial image, and N is the number of categories of the background of the initial image.
13. The method according to claim 1, wherein inputting the seventh feature map to the post-processing module to fuse the channels in the seventh feature map to obtain a panoramic segmentation result of the initial image includes:
performing 4 times of up-sampling operation on the seventh feature map to obtain a feature map 27;
normalizing the pixel values corresponding to the pixel points in the feature map 27 to be between 0 and 1 by using a sigmoid function to obtain a feature map 28;
the pixel point category corresponding to the maximum pixel value of the same pixel point in each channel is taken as the final category of the pixel point, a characteristic diagram 29 is obtained, and the characteristic diagram 29 is the panoramic segmentation result of the initial image; the number of channels of the feature map 29 is 1;
Wherein the pixel point categories are M+N; m is the number of foreground classes of the initial image, and different examples in the same foreground class correspond to different example numbers; and N is the category number of the background of the initial image.
14. A multi-branch feature extraction-based panoramic segmentation system, the system comprising:
the preprocessing module is used for preprocessing the RGB image to obtain an initial image;
the panoramic segmentation module is used for inputting the initial image into a pre-constructed panoramic segmentation network to obtain a panoramic segmentation result of the initial image;
the pre-constructed panoramic segmentation network comprises: a backbone network, a detail extraction branch network, an instance positioning branch network, a channel attention branch network, a feature aggregation branch network, a feature encoding network and a post-processing module;
the panorama segmentation module comprises:
the first generation unit is used for inputting the initial image into a backbone network for high-dimensional feature extraction and generating a first feature map;
the second generation unit is used for inputting the initial image into the detail extraction branch network to perform low-dimensional feature extraction and generating a second feature map;
the third generating unit is used for inputting the first feature map into an example positioning branch network to position a foreground example and a background area and generating a third feature map;
The fourth generation unit is used for inputting the first feature map to the channel attention branch network, distributing weights for all channels of the first feature map and generating a fourth feature map;
the fifth generation unit is used for inputting the second feature map and the fourth feature map into the feature aggregation branch network for fusion to generate a fifth feature map;
the sixth generation unit is used for inputting the fifth characteristic diagram into the characteristic coding network to be coded, and generating a sixth characteristic diagram;
a seventh generating unit, configured to add the third feature map and the sixth feature map pixel by pixel, to generate a seventh feature map;
the eighth generating unit is used for inputting the seventh feature map to the post-processing module so as to fuse all channels in the seventh feature map and obtain a panoramic segmentation result of the initial image;
the detail extraction branch network comprises: convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9 and convolution module C10;
convolution module C3, convolution module C4, convolution module C5, convolution module C6, convolution module C7, convolution module C8, convolution module C9, and convolution module C10 each include: a convolution layer, a batch normalization layer, and an activation function;
The convolution kernel, input channel, output channel, step size and resolution of the convolution layer of convolution module C3 are 3 x 3, 64, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C4 are 3 x 3, 64, and 1, respectively;
the convolution kernel, input channel, output channel, step size and resolution of the convolution layer of convolution module C5 are 3 x 3, 64, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C6 are 3×3, 64 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C7 are 3×3, 64 and 1, respectively;
the convolution kernel, input channel, output channel and step size and resolution of the convolution layer of convolution module C8 are 3 x 3, 64, 128, 2 and 1/2 of the resolution of the initial image, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of convolution module C9 are 3 x 3, 128, 256 and 1, respectively;
the convolution kernel, input channel, output channel and step size of the convolution layer of the convolution module C10 are 3×3, 256 and 1, respectively;
the feature aggregation branch network comprises: channel-by-channel convolution DWC1, convolution module C17, convolution module C18, UP-sampling layer UP1, convolution module C19, sigmoid layer, channel-by-channel convolution DWC2, convolution module C20, sigmoid layer, UP-sampling layer UP2, and convolution module C21;
Convolution modules C17, C18, C19, C20, and C21 each include: a convolution layer, a batch normalization layer, and an activation function;
the convolution kernel, step size, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC1 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step size, grouping number, input channel and output channel of the convolution layer of the channel-by-channel convolution DWC2 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C17 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C18 are respectively: 3×3, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C19 are respectively: 3×3, 2, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C20 are respectively: 1×1, 1, 256, and 256;
the convolution kernel, step length, input channel and output channel of the convolution layer of the convolution module C21 are respectively: 3×3, 1, 256, and 256.
CN202310356730.4A 2023-04-04 2023-04-04 Panorama segmentation method and system based on multi-branch feature extraction Active CN116468889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310356730.4A CN116468889B (en) 2023-04-04 2023-04-04 Panorama segmentation method and system based on multi-branch feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310356730.4A CN116468889B (en) 2023-04-04 2023-04-04 Panorama segmentation method and system based on multi-branch feature extraction

Publications (2)

Publication Number Publication Date
CN116468889A CN116468889A (en) 2023-07-21
CN116468889B true CN116468889B (en) 2023-11-07

Family

ID=87183555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310356730.4A Active CN116468889B (en) 2023-04-04 2023-04-04 Panorama segmentation method and system based on multi-branch feature extraction

Country Status (1)

Country Link
CN (1) CN116468889B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN111242954A (en) * 2020-01-20 2020-06-05 浙江大学 Panorama segmentation method with bidirectional connection and shielding processing
CN112598673A (en) * 2020-11-30 2021-04-02 北京迈格威科技有限公司 Panorama segmentation method, device, electronic equipment and computer readable medium
CN113222124A (en) * 2021-06-28 2021-08-06 重庆理工大学 SAUNet + + network for image semantic segmentation and image semantic segmentation method
CN115330594A (en) * 2022-07-19 2022-11-11 上海西虹桥导航技术有限公司 Target rapid identification and calibration method based on unmanned aerial vehicle oblique photography 3D model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11256960B2 (en) * 2020-04-15 2022-02-22 Adobe Inc. Panoptic segmentation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN111242954A (en) * 2020-01-20 2020-06-05 浙江大学 Panorama segmentation method with bidirectional connection and shielding processing
CN112598673A (en) * 2020-11-30 2021-04-02 北京迈格威科技有限公司 Panorama segmentation method, device, electronic equipment and computer readable medium
CN113222124A (en) * 2021-06-28 2021-08-06 重庆理工大学 SAUNet + + network for image semantic segmentation and image semantic segmentation method
CN115330594A (en) * 2022-07-19 2022-11-11 上海西虹桥导航技术有限公司 Target rapid identification and calibration method based on unmanned aerial vehicle oblique photography 3D model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
An End-To-End Network for Panoptic Segmentation;H. Liu 等;《2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)》;第6165-6174页 *
MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers;Huiyu Wang 等;《arXiv》;第1-20页 *
Panoptic Feature Pyramid Networks;Kaiming He 等;《arXiv》;第1-10页 *
Panoptic segmentation method based on pixel-level instance perception;Yuhao Wu 等;《PROCEEDINGS OF SPIE》;全文 *
基于机器学习方法的图像分割技术研究—一种多分支全景分割网络模型;乔昂;《中国优秀硕士学位论文全文数据库信息科技辑》(第01期);第3-4章 *

Also Published As

Publication number Publication date
CN116468889A (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN111583097A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN111914654B (en) Text layout analysis method, device, equipment and medium
CN116051549B (en) Method, system, medium and equipment for dividing defects of solar cell
CN113487618B (en) Portrait segmentation method, portrait segmentation device, electronic equipment and storage medium
CN114973049B (en) Lightweight video classification method with unified convolution and self-attention
CN112508099A (en) Method and device for detecting target in real time
CN114332133A (en) New coronary pneumonia CT image infected area segmentation method and system based on improved CE-Net
CN114359554A (en) Image semantic segmentation method based on multi-receptive-field context semantic information
CN116740527A (en) Remote sensing image change detection method combining U-shaped network and self-attention mechanism
CN116796287A (en) Pre-training method, device, equipment and storage medium for graphic understanding model
CN113705575B (en) Image segmentation method, device, equipment and storage medium
CN116266259A (en) Image and text structured output method and device, electronic equipment and storage medium
CN113554655B (en) Optical remote sensing image segmentation method and device based on multi-feature enhancement
CN116468889B (en) Panorama segmentation method and system based on multi-branch feature extraction
CN113313162A (en) Method and system for detecting multi-scale feature fusion target
CN112446439A (en) Inference method and system for deep learning model dynamic branch selection
CN113076902A (en) Multi-task fused figure fine-grained segmentation system and method
CN116030256A (en) Small object segmentation method, small object segmentation system, device and medium
CN112507933B (en) Saliency target detection method and system based on centralized information interaction
CN117011416A (en) Image processing method, device, equipment, medium and program product
CN112906679B (en) Pedestrian re-identification method, system and related equipment based on human shape semantic segmentation
CN112488115B (en) Semantic segmentation method based on two-stream architecture
CN113610856A (en) Method and device for training image segmentation model and image segmentation
CN113222012A (en) Automatic quantitative analysis method and system for lung digital pathological image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant