CN112686267A

CN112686267A - Image semantic segmentation method and device

Info

Publication number: CN112686267A
Application number: CN202110033687.9A
Authority: CN
Inventors: 范铭源; 赖申其; 黄君实; 罗钧峰; 魏晓明; 张珂; 苏金明; 郭魏铭
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-01-11
Filing date: 2021-01-11
Publication date: 2021-04-20

Abstract

The disclosure provides an image semantic segmentation method and device. The method comprises the following steps: inputting an image to be processed into an image recognition model; the image recognition model includes: short-time dense connection network layer and decoding network layer, short-time dense connection network layer includes: the system comprises a convolution module, a plurality of short-time dense connection layers and an output module, wherein each short-time dense connection layer comprises a plurality of short-time dense connection modules; calling a convolution module to process an image to be processed to obtain a first feature map corresponding to the image to be processed; calling a short-time dense connection layer to process the first characteristic diagram to obtain a second characteristic diagram; calling an output module to process the second feature map to obtain a third feature map; calling a decoding network layer to perform up-sampling on the third feature map and map the third feature map to the segmentation classes to obtain a fourth feature map with the channel number being the segmentation class number; and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map. The method and the device can reduce the network structural redundancy and improve the image semantic segmentation performance and efficiency.

Description

Image semantic segmentation method and device

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, in particular to an image semantic segmentation method and device.

Background

The semantic segmentation is a basic direction in the field of computer vision, and with the progress of deep learning methods in recent years, the semantic segmentation direction has advanced sufficiently, and is applied to more and more scenes, such as automatic driving, man-machine interaction, medical analysis, augmented reality and the like.

In order to meet the requirements of real-time performance and effectiveness of semantic segmentation, one mainstream method of designing the current real-time semantic segmentation is as follows: the existing lightweight backbone network is selected for coding, and a lightweight decoding network is designed autonomously to improve the segmentation efficiency, wherein ResNet-18, Xception-39 and the like are the backbone networks frequently selected, but the backbone networks lack the customized design for the segmentation task, the structure of the backbone networks possibly generates structural redundancy for the segmentation task, and the semantic segmentation performance and efficiency are reduced when the backbone networks are directly used as the backbone networks.

Disclosure of Invention

The embodiment of the disclosure provides an image semantic segmentation method and device, which are used for enabling a network to capture multi-scale receptive field information required by a segmentation task and remove structural redundancy, so that the network operates efficiently and meets the requirement of real-time semantic segmentation.

According to a first aspect of embodiments of the present disclosure, there is provided an image semantic segmentation method, including:

inputting an image to be processed into an image recognition model; the image recognition model includes: a short time dense connection network layer and a decoding network layer, the short time dense connection network layer comprising: the system comprises a convolution module, a plurality of short-time dense connection layers and an output module, wherein the short-time dense connection layers are connected in series;

calling the convolution module to process the image to be processed to obtain a first feature map corresponding to the image to be processed;

calling the short-time dense connection layer to process the first feature diagram to obtain a second feature diagram corresponding to the first feature diagram;

calling the output module to process the second feature diagram to obtain a third feature diagram corresponding to the second feature diagram;

calling the decoding network layer to perform up-sampling on the third feature map, and mapping to a segmentation class to obtain a fourth feature map with the channel number as the segmentation class number;

and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map.

Optionally, the convolution module includes: a first convolution unit and a second convolution unit,

the calling the convolution module to process the image to be processed to obtain a first feature map corresponding to the image to be processed, including:

calling the first convolution unit to perform double downsampling on the image to be processed through a convolution core with a set size to obtain an intermediate feature map with a first channel number;

calling the second convolution unit to perform double downsampling on the intermediate feature map through a convolution core with a set size to obtain a first feature map with the number of channels being the second number of channels;

wherein the number of the second channels is 2 times of the number of the first channels.

Optionally, the plurality of short time dense connection modules in each short time dense connection layer comprises: a first short-time dense connection module and at least one second short-time dense connection module,

the calling the short-time dense connection layer to process the first feature diagram to obtain a second feature diagram corresponding to the first feature diagram, and the method includes:

calling a first short-time dense connection module in a first short-time dense connection layer in the plurality of short-time dense connection layers to perform double down-sampling on the first feature map to obtain a first initial feature map with the number of channels being a third number of channels; the third channel number is 2 times of the second channel number;

calling at least one second short-time dense connection module in the first short-time dense connection layer to perform short-time dense connection processing on the first initial feature map to obtain a first processing feature map with a third channel number;

calling the nth short-time dense connecting layer except the first short-time dense connecting layer in the plurality of short-time dense connecting layers, and performing double down-sampling and short-time dense connecting processing on the feature map output by the (n-1) th short-time dense connecting layer to obtain an nth processing feature map;

acquiring a processing characteristic diagram output by the last short-time dense connection layer in the plurality of short-time dense connection layers, and taking the processing characteristic diagram as the second characteristic diagram;

wherein n is a positive integer greater than or equal to 2, and the number of channels of the nth processing characteristic diagram is 2 times of the number of channels of the (n-1) th processing characteristic diagram output by the (n-1) th short-time dense connection layer.

Optionally, the output module includes: convolutional layers, average pooling layers, and full convolutional networking layers,

the calling the output module to process the second feature map to obtain a third feature map corresponding to the second feature map includes:

calling the convolutional layer to carry out down-sampling on the second feature map to obtain a sampling feature map;

calling the average pooling layer to perform pooling processing on the sampling feature map to obtain a pooling feature map;

and calling the full convolution network layer to carry out convolution processing on the pooled feature map to obtain the third feature map.

Optionally, each short-time dense connection module includes a plurality of operation blocks, the number of channels from the second operation block to the second last operation block in the operation blocks of each short-time dense connection module is in an equal proportional number sequence, and the number of channels of the last operation block is the same as that of the second last operation block.

According to a second aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation apparatus including:

the image to be processed input module is used for inputting the image to be processed to the image recognition model; the image recognition model includes: a short time dense connection network layer and a decoding network layer, the short time dense connection network layer comprising: the system comprises a convolution module, a plurality of short-time dense connection layers and an output module, wherein the short-time dense connection layers are connected in series;

the first characteristic diagram acquisition module is used for calling the convolution module to process the image to be processed to obtain a first characteristic diagram corresponding to the image to be processed;

the second characteristic diagram acquisition module is used for calling the short-time dense connection layer to process the first characteristic diagram to obtain a second characteristic diagram corresponding to the first characteristic diagram;

the third feature map acquisition module is used for calling the output module to process the second feature map to obtain a third feature map corresponding to the second feature map;

the fourth feature map acquisition module is used for calling the decoding network layer to perform upsampling on the third feature map and mapping the upsampled third feature map to the segmentation categories to obtain a fourth feature map with the number of channels as the number of the segmentation categories;

and the semantic segmentation result determining module is used for determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map.

the first feature map acquisition module includes:

the intermediate feature map acquisition sub-module is used for calling the first convolution unit to perform double down-sampling on the image to be processed through a convolution core with a set size to obtain an intermediate feature map with a first channel number;

the first characteristic diagram obtaining submodule is used for calling the second convolution unit to carry out double down sampling on the intermediate characteristic diagram through a convolution core with a set size to obtain a first characteristic diagram with the number of channels being the second number of channels;

the second feature map acquisition module includes:

a first initial feature map obtaining sub-module, configured to call a first short-time dense connection module in a first short-time dense connection layer of the multiple short-time dense connection layers to perform double down-sampling on the first feature map, so as to obtain a first initial feature map with a third channel number; the third channel number is 2 times of the second channel number;

a first processing feature map obtaining sub-module, configured to invoke at least one second short-time dense connection module in the first short-time dense connection layer to perform short-time dense connection processing on the first initial feature map, so as to obtain a first processing feature map with a third channel number;

the nth processing characteristic diagram acquisition sub-module is used for calling the nth short-time dense connecting layer except the first short-time dense connecting layer in the plurality of short-time dense connecting layers and carrying out double down-sampling and short-time dense connecting processing on the characteristic diagram output by the (n-1) th short-time dense connecting layer to obtain an nth processing characteristic diagram;

the second characteristic diagram obtaining submodule is used for obtaining a processing characteristic diagram output by the last short-time dense connecting layer in the plurality of short-time dense connecting layers and taking the processing characteristic diagram as the second characteristic diagram;

the third feature map acquisition module includes:

the sampling characteristic diagram acquisition sub-module is used for calling the convolutional layer to carry out down-sampling on the second characteristic diagram to obtain a sampling characteristic diagram;

the pooling characteristic map obtaining sub-module is used for calling the average pooling layer to perform pooling processing on the sampling characteristic map to obtain a pooling characteristic map;

and the third characteristic diagram obtaining submodule is used for calling the full convolution network layer to carry out convolution processing on the pooled characteristic diagram to obtain the third characteristic diagram.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor, a memory, and a computer program stored on the memory and executable on the processor, the processor implementing the image semantic segmentation method according to any one of the above when executing the program.

According to a fourth aspect of embodiments of the present disclosure, there is provided a readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform any one of the image semantic segmentation methods described above.

The embodiment of the disclosure provides an image semantic segmentation method and device, wherein an image to be processed is input into an image recognition model, the image recognition model comprises a short-time dense connection network layer and a decoding network layer, the short-time dense connection network layer comprises a convolution module, a plurality of short-time dense connection layers connected in series and an output module, each short-time dense connection layer comprises a plurality of short-time dense connection modules, the convolution module is called to process the image to be processed to obtain a first feature map corresponding to the image to be processed, the short-time dense connection layers are called to process the first feature map to obtain a second feature map corresponding to the first feature map, the output module is called to process the second feature map to obtain a third feature map corresponding to the second feature map, the decoding network layer is called to up-sample the third feature map to segmentation categories, and obtaining a fourth feature map with the channel number as the segmentation class number, and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map. According to the embodiment of the disclosure, the image processing is performed by adopting the short-time dense connection modules connected in series, and the characteristic dimension is gradually reduced along with the depth of the network module, so that the network can encode and divide the required receptive field information, meanwhile, the structural redundancy of the network is reduced, and the performance and efficiency of image semantic division are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a flowchart illustrating steps of an image semantic segmentation method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating steps of another image semantic segmentation method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of model configuration parameters provided by an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an image semantic segmentation apparatus provided in an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of another image semantic segmentation apparatus provided in an embodiment of the present disclosure.

Detailed Description

Technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present disclosure, belong to the protection scope of the embodiments of the present disclosure.

Example one

Referring to fig. 1, a flowchart illustrating steps of an image semantic segmentation method provided by an embodiment of the present disclosure is shown, and as shown in fig. 1, the image semantic segmentation method may specifically include the following steps:

step 101: inputting an image to be processed into an image recognition model; the image recognition model includes: a short time dense connection network layer and a decoding network layer, the short time dense connection network layer comprising: the device comprises a convolution module, a plurality of short-time dense connecting layers and an output module, wherein the short-time dense connecting layers are connected in series, and each short-time dense connecting layer comprises a plurality of short-time dense connecting modules.

The embodiment of the disclosure can be applied to scenes for performing semantic segmentation on images.

The image to be processed refers to an image needing image semantic segmentation. In this example, the image to be processed may be a dish image, a map image, or the like, and specifically, may be determined according to business requirements, which is not limited in this embodiment.

The image recognition model refers to a model for semantically segmenting an image, that is, a model for recognizing a pixel class of each pixel in an image.

The model architecture of the image recognition model is as follows: a short time dense connection network layer (STDC net) and a decoding network layer, wherein the short time dense connection network layer may include: the device comprises a convolution module, a plurality of short-time dense connecting layers and an output module, wherein the short-time dense connecting layers are connected in series, and each short-time dense connecting layer comprises a plurality of short-time dense connecting modules.

The short-time dense connection layer is a network layer formed by a plurality of short-time dense connection modules connected in series, the structures of the plurality of short-time dense connection modules are the same, each short-time dense connection module may include a plurality of operation blocks (for example, each short-time dense connection module may include Block1, Block2, Block3,. till.. Block n), the number of channels from the second operation Block to the penultimate operation Block in the operation blocks of each short-time dense connection module is an equal proportional number sequence, the number of channels of the last operation Block is the same as that of the penultimate operation Block, specifically, the number of channels of the last operation Block is half of that of the previous feature Block from the second operation Block (i.e., the short-time dense connection module) to the penultimate operation Block, and the number of channels of the last operation Block is the same as that of the penultimate core. In this embodiment, in the short-time dense connection module, the kernel size of the first operation block is 1 × 1, the main role is to reduce the dimension of the input feature to half of the output dimension, and the kernels of the other operation blocks except the first operation block are all 3 × 3.

After the to-be-processed image is acquired, the to-be-processed image may be input to the image recognition model, and then, step 102 is performed.

Step 102: and calling the convolution module to process the image to be processed to obtain a first feature map corresponding to the image to be processed.

The first characteristic diagram refers to an image obtained after the image to be processed is processed by the convolution module.

After the image to be processed is input to the image recognition model, a convolution module can be called to process the image to be processed so as to obtain a first feature map corresponding to the image to be processed. Specifically, the convolution module may include two convolution units, and first, a first convolution unit may be adopted to perform downsampling on an image to be processed to obtain an intermediate image, and then a second convolution unit is adopted to perform downsampling on the intermediate image to obtain a first feature map.

After the convolution module is called to process the image to be processed to obtain the first feature map corresponding to the image to be processed, step 103 is executed.

Step 103: and calling the short-time dense connection layer to process the first characteristic diagram to obtain a second characteristic diagram corresponding to the first characteristic diagram.

The second characteristic diagram is an image obtained after the first characteristic diagram is processed by adopting a short-time dense connection module.

After a convolution module is called to process an image to be processed to obtain a first feature map corresponding to the image to be processed, a short-time dense connecting layer can be called to process the first feature image to obtain a second feature map corresponding to the first feature map, specifically, a first short-time dense connecting module in a first short-time dense connecting layer in a plurality of short-time dense connecting modules can be called to perform down-sampling and short-time dense connecting processing on the first feature map to obtain a first processed image of a first feature map object, then a second short-time dense connecting layer is adopted to perform down-sampling and short-time dense connecting processing on the first processed image to obtain a second processed image, and so on until the last short-time dense connecting layer performs down-sampling and short-time dense connecting processing on a processed image output by a second short-time dense connecting layer, a second signature can be obtained.

It can be understood that, in the description of step 102 to step 103, the number of channels of the feature map output by each network structure layer is 2 times of the number of channels of the feature map output by the previous network structure layer, and the feature channels of the short-time dense connection modules are reduced layer by layer, so that feature redundancy can be effectively removed, and the efficiency of the network is improved.

After the short-time dense connection layer is called to process the first feature map to obtain a second feature map corresponding to the first feature map, step 104 is executed.

Step 104: and calling the output module to process the second characteristic diagram to obtain a third characteristic diagram corresponding to the second characteristic diagram.

The third feature diagram refers to the feature diagram output after the second feature diagram is processed by calling the output module.

After the short-time dense connection layer is called to process the first feature map to obtain a second feature map corresponding to the first feature map, the output module may be called to process the second feature map to obtain a third feature map corresponding to the second feature map.

And after the output module is called to process the second feature map to obtain a third feature map corresponding to the second feature map, executing step 105.

Step 105: and calling the decoding network layer to perform up-sampling on the third feature map, and mapping to the segmentation classes to obtain a fourth feature map with the channel number being the segmentation class number.

The fourth feature map is a feature map in which the number of channels is the number of segmentation classes obtained after the third feature map is up-sampled by a decoding network layer and mapped to the segmentation classes.

After the output module is called to process the second feature map to obtain a third feature map corresponding to the second feature map, the decoding network layer can be called to perform upsampling on the third feature map and map the third feature map to the segmentation classes, and then a fourth feature map with the channel number being the segmentation class number can be obtained.

After the fourth feature map with the number of channels being the number of segmentation categories is obtained, step 106 is executed.

Step 106: and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map.

After the fourth feature map with the channel number being the segmentation class number is obtained, the semantic segmentation result corresponding to the image to be processed can be determined according to the fourth feature map, that is, the pixel class of each pixel in the image to be processed can be determined through the fourth feature map, and which region in the image to be processed belongs to which class can be determined by combining the pixel class.

According to the embodiment of the disclosure, the image processing is performed by adopting the short-time dense connection modules connected in series, and the characteristic dimensionality is gradually reduced along with the depth of the network module, so that the structural redundancy of the network is reduced while the network can encode and segment the required receptive field information.

The image semantic segmentation method provided by the embodiment of the disclosure inputs an image to be processed into an image recognition model, wherein the image recognition model comprises a short-time dense connection network layer and a decoding network layer, the short-time dense connection network layer comprises a convolution module, a plurality of short-time dense connection layers connected in series and an output module, each short-time dense connection layer comprises a plurality of short-time dense connection modules, the convolution module is invoked to process the image to be processed to obtain a first feature map corresponding to the image to be processed, the short-time dense connection layers are invoked to process the first feature map to obtain a second feature map corresponding to the first feature map, the output module is invoked to process the second feature map to obtain a third feature map corresponding to the second feature map, the decoding network layer is invoked to perform upsampling on the third feature map, and the segmentation class is mapped to obtain a fourth feature map with the channel number being the segmentation class number, and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map. According to the embodiment of the disclosure, the image processing is performed by adopting the short-time dense connection modules connected in series, and the characteristic dimension is gradually reduced along with the depth of the network module, so that the network can encode and divide the required receptive field information, meanwhile, the structural redundancy of the network is reduced, and the performance and efficiency of image semantic division are improved.

Example two

Referring to fig. 2, a flowchart illustrating steps of another image semantic segmentation method provided by an embodiment of the present disclosure is shown, and as shown in fig. 2, the image semantic segmentation method may specifically include the following steps:

step 201: inputting an image to be processed into an image recognition model; the image recognition model includes: a short time dense connection network layer and a decoding network layer, the short time dense connection network layer comprising: the device comprises a convolution module, a plurality of short-time dense connecting layers and an output module, wherein the short-time dense connecting layers are connected in series, and each short-time dense connecting layer comprises a plurality of short-time dense connecting modules.

The short-time dense connection layer is a network layer formed by a plurality of short-time dense connection modules connected in series, the structures of the plurality of short-time dense connection modules are the same, each short-time dense connection module comprises a plurality of operation blocks, the number of channels from the second operation block to the penultimate operation block in the operation blocks of each short-time dense connection module is in an equal ratio array, the number of channels of the last operation block is the same as that of the penultimate operation block, specifically, the number of channels of cores from the second operation block (namely the short-time dense connection module) to the penultimate operation block is half of that of the previous characteristic block, and the number of channels of cores of the last operation block is consistent with that of the penultimate core. In this embodiment, in the short-time dense connection module, the kernel size of the first operation block is 1 × 1, the main role is to reduce the dimension of the input feature to half of the output dimension, and the kernels of the other operation blocks except the first operation block are all 3 × 3.

After the to-be-processed image is acquired, the to-be-processed image may be input to the image recognition model, and then, step 202 is performed.

Step 202: and calling the first convolution unit to perform double downsampling on the image to be processed through a convolution core with a set size to obtain an intermediate feature map with the number of channels being the first number of channels.

The following describes the technical solution of the embodiment of the present disclosure in detail with reference to fig. 3.

In this embodiment, the first convolution unit may be ConvX1, and after the Image to be processed (the Image shown in fig. 3, the output size of the Image is 224 × 224) is input to the Image recognition model, ConvX1 may be called to perform double down-sampling to obtain an intermediate feature map with the number of channels being the first number of channels, and after the double down-sampling, as shown in fig. 3, the obtained intermediate feature map with the number of channels being 32 may be called, specifically, ConvX1 may be called to perform processing with a step size of 2 on the Image to be processed by a convolution kernel of 3 × 3, so as to output one intermediate feature map with the number of channels being 32 by ConvX 1.

After the first convolution unit is called to perform double down-sampling on the image to be processed by the convolution kernel with the set size, so as to obtain the intermediate feature map with the first channel number, step 203 is executed.

Step 203: and calling the second convolution unit to perform double downsampling on the intermediate feature map through a convolution core with a set size to obtain a first feature map with the number of channels being the second number of channels.

The second convolution unit is ConvX2 shown in fig. 3, and after obtaining the intermediate feature map, ConvX2 may be called to perform down-sampling on the intermediate feature map by twice by using a convolution kernel with a set size, so as to obtain the first feature map with the number of channels being the second number of channels, the processing manner of this process is similar to that in step 202, and details of the embodiment of the present disclosure are not repeated here.

In this embodiment, the number of the second channels is 2 times that of the first channels, and the increase of the number of the feature map channels can improve the receptive field of image coding.

After the second convolution unit is called and the intermediate feature map is downsampled twice by the convolution kernel with the set size to obtain the first feature map with the number of channels being the second number of channels, step 204 is executed.

Step 204: calling a first short-time dense connection module in one short-time dense connection layer of the plurality of short-time dense connection layers to perform double down-sampling on the first feature map to obtain a first initial feature map with the number of channels being a third number of channels; the third channel number is 2 times of the second channel number.

In the present embodiment, the plurality of short time dense connection modules in each short time dense connection layer includes: the system comprises a first short-time dense connection module and at least one second short-time dense connection module, wherein the first short-time dense connection module can be used for performing double down-sampling on the feature map output by the last module, and the at least one second short-time dense connection module can be used for performing short-time dense connection processing on the feature map output by the first short-time dense connection module.

The first initial feature map is a feature map obtained by down-sampling a first feature map using a first short-time dense connection module in a first short-time dense connection layer of the plurality of short-time dense connection layers.

The third channel number refers to the channel number of the obtained first initial characteristic diagram.

After the second convolution unit is called to perform double down-sampling on the intermediate feature map by the convolution kernel with the set size to obtain the first feature map with the number of channels being the second number of channels, the first short-time dense connection module in the first short-time dense connection layer may be called to perform double down-sampling on the first feature map to obtain the first initial feature map with the number of channels being the third number of channels, where the third number of channels is 2 times the second number of channels.

After the first short-time dense connection module in the first short-time dense connection layer of the plurality of short-time dense connection layers is called to perform double down-sampling on the first feature map, so as to obtain a first initial feature map with the number of channels being the third number of channels, step 205 is executed.

Step 205: and calling at least one second short-time dense connection module in the first short-time dense connection layer to perform short-time dense connection processing on the first initial feature graph to obtain a first processing feature graph with the number of channels being a third number of channels.

After the first initial feature map is obtained, at least one second short-time dense connection module in the first short-time dense connection layer may be called to perform short-time dense connection processing on the first initial feature map to obtain a first processing feature map, where the number of channels of the first processing feature map is the same as that of the first initial feature map, and is the third number of channels.

After the first process profile is obtained, step 206 is performed.

Step 206: and calling the nth short-time dense connecting layer except the first short-time dense connecting layer in the plurality of short-time dense connecting layers, and performing double down-sampling and short-time dense connecting processing on the feature graph output by the (n-1) th short-time dense connecting layer to obtain an nth processing feature graph.

In this embodiment, for a plurality of short-time dense connection layers, an nth short-time dense connection layer except a first short-time dense connection layer in the plurality of short-time dense connection layers may be called, and the feature map output by the nth-1 short-time dense connection layer is subjected to double downsampling and short-time dense connection processing to obtain an nth processing feature map, where n is a positive integer greater than or equal to 2, and the number of channels in the nth processing feature map is 2 times the number of channels in the nth-1 processing feature map output by the nth-1 short-time dense connection layer.

Specifically, after the first processing feature map is obtained, the first short-time dense connection module in the second short-time dense connection layer is used for performing double down-sampling on the first processing feature map to obtain a second initial feature map, at least one second short-time dense connection module in the second short-time dense connection layer is used for performing short-time dense connection processing on the second initial feature map to obtain a second processing feature map, then the first short-time dense connection module in the third short-time dense connection layer is used for performing double down-sampling on the second processing feature map to obtain a third initial feature map, at least one second short-time dense connection module in the third short-time dense connection layer is used for performing short-time dense connection processing on the third initial feature map to obtain a third processing feature map, and so on until the last short-time dense connection layer performs double down-sampling on the processing feature map output by the penultimate short-time dense connection layer Sample and short-time dense connection processing.

Step 207: and acquiring a processing characteristic diagram output by the last short-time dense connection layer in the plurality of short-time dense connection layers, and taking the processing characteristic diagram as the second characteristic diagram.

After all the short-time dense connection layers are subjected to corresponding image processing, a processing characteristic diagram output by the last short-time dense connection layer in the plurality of short-time dense connection layers can be obtained, and the processing characteristic diagram is used as a second characteristic diagram.

After the second feature map is obtained, step 208 is performed.

Step 208: and calling the convolution layer to carry out down-sampling on the second feature map to obtain a sampling feature map.

In this embodiment, the output module may include a convolutional layer, an average pooling layer and a full convolutional network layer, as shown in fig. 3, the convolutional layer is ConvX6, the average pooling layer is GlobalPool, and the full convolutional network layer is FC (i.e., FC1 and FC 2).

After the second feature map is obtained, the convolutional layer may be called to down-sample the second feature map to obtain a sampled feature map having the same number of channels as the second feature map, as shown in fig. 3.

After the convolutional layer is called to down-sample the second feature map to obtain a sampled feature map, step 209 is performed.

Step 209: and calling the average pooling layer to perform pooling processing on the sampling feature map to obtain a pooling feature map.

After the sampling feature map is obtained, the average pooling layer can be called to perform pooling on the sampling feature map to obtain a pooled feature map.

The pooling process is a technique commonly used in the art, and the present embodiment will not be described in detail herein.

After the average pooling layer is called to pool the sampling feature map to obtain a pooled feature map, step 210 is performed.

Step 210: and calling the full convolution network layer to carry out convolution processing on the pooled feature map to obtain the third feature map.

After the average pooling layer is called to perform pooling processing on the sampling feature map to obtain a pooled feature map, the full convolution network layer can be called to perform convolution processing on the pooled feature map to obtain a third feature map.

After the third feature map is obtained, step 211 is executed.

Step 211: and calling the decoding network layer to perform up-sampling on the third feature map, and mapping to the segmentation classes to obtain a fourth feature map with the channel number being the segmentation class number.

After the fourth feature map with the number of channels being the number of segmentation categories is obtained, step 212 is executed.

Step 212: and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map.

EXAMPLE III

Referring to fig. 4, a schematic structural diagram of an image semantic segmentation apparatus provided by an embodiment of the present disclosure is shown, and as shown in fig. 4, the image semantic segmentation apparatus 400 may specifically include the following modules:

a to-be-processed image input module 410, configured to input the to-be-processed image to the image recognition model; the image recognition model includes: a short time dense connection network layer and a decoding network layer, the short time dense connection network layer comprising: the system comprises a convolution module, a plurality of short-time dense connection layers and an output module, wherein the short-time dense connection layers are connected in series;

a first feature map obtaining module 420, configured to invoke the convolution module to process the image to be processed, so as to obtain a first feature map corresponding to the image to be processed;

a second feature map obtaining module 430, configured to invoke the short-time dense connection layer to process the first feature map, so as to obtain a second feature map corresponding to the first feature map;

a third feature map obtaining module 440, configured to invoke the output module to process the second feature map, so as to obtain a third feature map corresponding to the second feature map;

a fourth feature map obtaining module 450, configured to invoke the decoding network layer to perform upsampling on the third feature map, and map the upsampled third feature map to a segmentation class, so as to obtain a fourth feature map with a channel number being the number of segmentation classes;

a semantic segmentation result determining module 460, configured to determine a semantic segmentation result corresponding to the image to be processed according to the fourth feature map.

The image semantic segmentation device provided by the embodiment of the disclosure inputs an image to be processed into an image recognition model, wherein the image recognition model comprises a short-time dense connection network layer and a decoding network layer, the short-time dense connection network layer comprises a convolution module, a plurality of short-time dense connection layers connected in series and an output module, each short-time dense connection layer comprises a plurality of short-time dense connection modules, the convolution module is invoked to process the image to be processed to obtain a first feature map corresponding to the image to be processed, the short-time dense connection layers are invoked to process the first feature map to obtain a second feature map corresponding to the first feature map, the output module is invoked to process the second feature map to obtain a third feature map corresponding to the second feature map, the decoding network layer is invoked to perform upsampling on the third feature map, and the segmentation class is mapped to obtain a fourth feature map with the channel number being the segmentation class number, and determining a semantic segmentation result corresponding to the image to be processed according to the fourth feature map. According to the embodiment of the disclosure, the image processing is performed by adopting the short-time dense connection modules connected in series, and the characteristic dimension is gradually reduced along with the depth of the network module, so that the network can encode and divide the required receptive field information, meanwhile, the structural redundancy of the network is reduced, and the performance and efficiency of image semantic division are improved.

Example four

Referring to fig. 5, a schematic structural diagram of another image semantic segmentation apparatus provided in an embodiment of the present disclosure is shown, and as shown in fig. 5, the image semantic segmentation apparatus 500 may specifically include the following modules:

a to-be-processed image input module 510, configured to input the to-be-processed image to the image recognition model; the image recognition model includes: a short time dense connection network layer and a decoding network layer, the short time dense connection network layer comprising: the system comprises a convolution module, a plurality of short-time dense connection layers and an output module, wherein the short-time dense connection layers are connected in series;

a first feature map obtaining module 520, configured to invoke the convolution module to process the image to be processed, so as to obtain a first feature map corresponding to the image to be processed;

a second feature map obtaining module 530, configured to invoke the short-time dense connection layer to process the first feature map, so as to obtain a second feature map corresponding to the first feature map;

a third feature map obtaining module 540, configured to invoke the output module to process the second feature map, so as to obtain a third feature map corresponding to the second feature map;

a fourth feature map obtaining module 550, configured to invoke the decoding network layer to perform upsampling on the third feature map, and map the upsampled third feature map to a segmentation class, so as to obtain a fourth feature map with a channel number being a segmentation class number;

and a semantic segmentation result determining module 560, configured to determine a semantic segmentation result corresponding to the image to be processed according to the fourth feature map.

the first feature map obtaining module 520 includes:

the intermediate feature map obtaining sub-module 521 is configured to invoke the first convolution unit to perform double downsampling on the to-be-processed image through a convolution kernel with a set size, so as to obtain an intermediate feature map with a first channel number;

the first feature map obtaining sub-module 522 is configured to invoke the second convolution unit to perform double downsampling on the intermediate feature map by using a convolution kernel with a set size, so as to obtain a first feature map with a second channel number;

the second feature map obtaining module 530 includes:

a first initial feature map obtaining sub-module 531, configured to call a first short-time dense connection module in a first short-time dense connection layer of the multiple short-time dense connection layers to perform double down-sampling on the first feature map, so as to obtain a first initial feature map with a third channel number; the third channel number is 2 times of the second channel number;

a first processing feature map obtaining sub-module 532, configured to invoke at least one second short-time dense connection module in the first short-time dense connection layer to perform short-time dense connection processing on the first initial feature map, so as to obtain a first processing feature map with a third channel number;

an nth processing feature map obtaining sub-module 533, configured to call an nth short-time dense connection layer, excluding the first short-time dense connection layer, of the multiple short-time dense connection layers, and perform double down-sampling and short-time dense connection processing on the feature map output by the (n-1) th short-time dense connection layer to obtain an nth processing feature map;

the second feature map obtaining sub-module 534 is configured to obtain a processing feature map output by the last short-time dense connection layer in the plurality of short-time dense connection layers, and use the processing feature map as the second feature map;

the third feature map obtaining module 540 includes:

a sampling feature map acquisition sub-module 541, configured to invoke the convolutional layer to perform downsampling on the second feature map to obtain a sampling feature map;

the pooling feature map obtaining sub-module 542 is configured to invoke the average pooling layer to perform pooling processing on the sampling feature map, so as to obtain a pooling feature map;

and a third feature map obtaining sub-module 543, configured to call the full convolution network layer to perform convolution processing on the pooled feature map, so as to obtain the third feature map.

An embodiment of the present disclosure also provides an electronic device, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the image semantic segmentation method of the foregoing embodiments when executing the program.

Embodiments of the present disclosure also provide a readable storage medium, wherein when the instructions in the storage medium are executed by a processor of an electronic device, the electronic device is enabled to execute the image semantic segmentation method of the foregoing embodiments.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present disclosure are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the embodiments of the present disclosure as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the embodiments of the present disclosure.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the embodiments of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, claimed embodiments of the disclosure require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of an embodiment of this disclosure.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

The various component embodiments of the disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be understood by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in a motion picture generating device according to an embodiment of the present disclosure. Embodiments of the present disclosure may also be implemented as an apparatus or device program for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present disclosure may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit embodiments of the disclosure, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the disclosure may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The above description is only for the purpose of illustrating the preferred embodiments of the present disclosure and is not to be construed as limiting the embodiments of the present disclosure, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the embodiments of the present disclosure are intended to be included within the scope of the embodiments of the present disclosure.

The above description is only a specific implementation of the embodiments of the present disclosure, but the scope of the embodiments of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present disclosure, and all the changes or substitutions should be covered by the scope of the embodiments of the present disclosure. Therefore, the protection scope of the embodiments of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image semantic segmentation method, comprising:

2. The method of claim 1, wherein the convolution module comprises: a first convolution unit and a second convolution unit,

3. The method of claim 1, wherein the plurality of short time dense connection modules in each short time dense connection layer comprises: a first short-time dense connection module and at least one second short-time dense connection module,

4. The method of claim 1, wherein the output module comprises: convolutional layers, average pooling layers, and full convolutional networking layers,

5. The method according to claim 1, wherein each short time dense connection module comprises a plurality of operation blocks, the number of channels from the second operation block to the second last operation block in the operation blocks of each short time dense connection module is in an equal proportional number column, and the number of channels of the last operation block is the same as that of the second last operation block.

6. An image semantic segmentation apparatus, comprising:

7. The apparatus of claim 6, wherein the convolution module comprises: a first convolution unit and a second convolution unit,

the first feature map acquisition module includes:

8. The apparatus of claim 6, wherein the plurality of short time dense connection modules in each short time dense connection layer comprises: a first short-time dense connection module and at least one second short-time dense connection module,

the second feature map acquisition module includes:

9. The apparatus of claim 6, wherein the output module comprises: convolutional layers, average pooling layers, and full convolutional networking layers,

the third feature map acquisition module includes:

10. The apparatus of claim 6, wherein each short time dense connection module comprises a plurality of operation blocks, the number of channels from the second operation block to the second last operation block in the operation blocks of each short time dense connection module is in an equal proportional number column, and the number of channels of the last operation block is the same as that of the second last operation block.

11. An electronic device, comprising:

a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the image semantic segmentation method according to any one of claims 1 to 5 when executing the program.

12. A readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the image semantic segmentation method of any one of claims 1 to 5.