CN116563526A

CN116563526A - Image semantic segmentation method and device

Info

Publication number: CN116563526A
Application number: CN202210092854.1A
Authority: CN
Inventors: 张雷; 马泽国; 冯玉敏
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2023-08-08

Abstract

The invention discloses an image semantic segmentation method and device, and relates to the technical field of computers. One embodiment of the method comprises the following steps: acquiring an image to be segmented; determining a result feature map of the image to be segmented; determining a plurality of scale feature images of the image to be segmented, and generating a fusion feature image of the image to be segmented by utilizing the plurality of scale feature images; obtaining an enhanced feature map of the image to be segmented according to the result feature map and the fusion feature map; and obtaining an image semantic segmentation result of the image to be segmented by utilizing the enhanced feature map. The embodiment can more accurately perform semantic segmentation on the image.

Description

Image semantic segmentation method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for image semantic segmentation.

Background

Image semantic segmentation refers to dividing different objects in an image into different regions, each of which is non-overlapping and has its own features or similarity. Before deep learning has not emerged, image segmentation methods generally use conventional image semantic segmentation methods such as an edge detection-based segmentation method, a threshold-based segmentation method, and a region-based segmentation method. With the advance of big data age and the development of deep learning, an image semantic segmentation method based on the deep learning becomes the mainstream, but the existing image semantic segmentation method based on the deep learning is often low in accuracy.

Disclosure of Invention

In view of the above, the embodiment of the invention provides an image semantic segmentation method and device, which can more accurately carry out semantic segmentation on an image.

In a first aspect, an embodiment of the present invention provides an image semantic segmentation method, including:

acquiring an image to be segmented;

determining a result feature map of the image to be segmented, wherein the result feature map is used for representing semantic information of the image to be segmented;

determining a plurality of scale feature images of the image to be segmented, and generating a fusion feature image of the image to be segmented by utilizing the plurality of scale feature images;

obtaining an enhanced feature map of the image to be segmented according to the result feature map and the fusion feature map;

and obtaining an image semantic segmentation result of the image to be segmented by utilizing the enhanced feature map.

Optionally, the determining the result feature map of the image to be segmented includes:

determining an output feature map of the image to be segmented by using a backbone network;

determining a plurality of target hole convolution kernels;

for each of the target hole convolution kernels: carrying out convolution processing on the output feature map by utilizing the target cavity convolution kernel to obtain a cavity feature map;

and generating a result feature map of the image to be segmented according to the cavity feature maps.

Optionally, the generating a result feature map of the image to be segmented according to the hole feature maps includes:

carrying out pooling treatment on the output feature map to obtain a pooled feature map of the image to be segmented;

and splicing the cavity feature images and the pooling feature images to generate a result feature image of the image to be segmented.

Optionally, generating a fused feature map of the image to be segmented using the plurality of scale feature maps includes:

according to the resolution, sequencing the plurality of scale feature images in order from small to large;

determining a scale feature map with the minimum resolution from the plurality of scale feature maps as a current feature map;

determining a later characteristic diagram of the current characteristic diagram from the sequenced multiple scale characteristic diagrams;

performing up-sampling processing on the current feature map so that the resolution of the processed current feature map is equal to the resolution of the later feature map;

combining the processed current feature map with the subsequent feature map, and determining a combination result as an output feature map of the subsequent feature map;

and generating a fusion feature map of the image to be segmented according to the output feature map of the latter feature map.

Optionally, the generating a fusion feature map of the image to be segmented according to the output feature map of the latter feature map includes:

determining the latter feature map as a current feature map;

performing up-sampling processing on the output feature map of the current feature map so that the resolution of the processed output feature map is equal to the resolution of the subsequent feature map;

combining the processed output feature map with the subsequent feature map, and determining a combination result as the output feature map of the subsequent feature map;

and the like, until an output characteristic diagram of the scale characteristic diagram with the highest resolution in the plurality of scale characteristic diagrams is obtained;

and determining the output feature map of the scale feature map with the maximum resolution as a fusion feature map of the image to be segmented.

Optionally, the obtaining the enhanced feature map of the image to be segmented according to the result feature map and the fusion feature map includes:

performing downsampling processing on the result feature map;

and performing splicing processing on the downsampled result feature map and the fusion feature map to generate an enhanced feature map of the image to be segmented.

Optionally, the obtaining the image semantic segmentation result of the image to be segmented by using the enhancement feature map includes:

inputting the enhanced feature map to an attention mechanism module to obtain an attention feature map of the image to be segmented, wherein the attention mechanism module comprises: a channel attention module and/or a spatial attention module, wherein the channel attention module is used for compressing the enhancement feature map in a spatial dimension, and the spatial attention module is used for compressing the enhancement feature map in a channel dimension;

and obtaining an image semantic segmentation result of the image to be segmented by using the attention feature map.

In a second aspect, an embodiment of the present invention provides an image semantic segmentation apparatus, including:

the image acquisition module is used for acquiring an image to be segmented;

the first determining module is used for determining a result feature map of the image to be segmented, wherein the result feature map is used for representing semantic information of the image to be segmented;

the second determining module is used for determining a plurality of scale feature images of the image to be segmented and generating a fusion feature image of the image to be segmented by utilizing the plurality of scale feature images;

the enhancement module is used for obtaining an enhancement feature map of the image to be segmented according to the result feature map and the fusion feature map;

and the semantic segmentation module is used for obtaining an image semantic segmentation result of the image to be segmented by utilizing the enhanced feature map.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods of any of the embodiments described above.

In a fourth aspect, embodiments of the present invention provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the above embodiments.

One embodiment of the above invention has the following advantages or benefits: in the image semantic segmentation method based on deep learning in the prior art, the characteristic information of each stage of coding is only directly output to a decoding stage, and the influence caused by fusion characteristic information generated by an image in a shallow convolution stage and a deep convolution stage is ignored.

In the scheme of the embodiment of the invention, the fusion feature map of the image to be segmented is generated by utilizing a plurality of scale feature maps of the image to be segmented. The plurality of scale feature maps may include feature information of the image in a shallow convolution stage and a deep convolution stage. And generating a fusion feature map of the image to be segmented by utilizing the multiple scale feature maps of the image to be segmented. The fusion characteristic diagram comprises fusion characteristic information generated by the image in a shallow convolution stage and a deep convolution stage. And generating an enhanced feature map according to the result feature map and the fusion feature map of the image to be segmented. Because the enhanced feature map contains the fusion feature information generated by the image in the shallow convolution stage and the deep convolution stage, the enhanced feature map can be used for more accurately carrying out semantic segmentation on the image.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a flow chart of an image semantic segmentation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another image semantic segmentation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an improved void space pyramid pooling module provided by one embodiment of the present invention;

FIG. 4 is a flow chart of another image semantic segmentation method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a feature pyramid module provided by one embodiment of the present invention;

FIG. 6 is a schematic diagram of an image semantic segmentation model according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an image semantic segmentation device according to an embodiment of the present invention;

fig. 8 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Existing image semantic segmentation techniques are generally based on encoder-decoder structures. In the encoding stage, extracting semantic information of the image; in the decoding stage, detail information for the restored image is used. However, in the encoding stage, the network models only output the characteristic information of each encoding stage to the decoding stage directly, neglecting the influence of fusion characteristic information generated by the main network in the shallow layer convolution stage and the deep layer convolution stage, and do not apply the fusion characteristic information to the decoding stage. And the lack of post-processing during the decoding phase, the spatial detail information is not well recovered. For complex segmentation scenes, there is a greater difficulty in segmentation.

Based on the above, the invention provides an image semantic segmentation method. Fig. 1 is a flow chart of an image semantic segmentation method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101: and acquiring an image to be segmented.

Step 102: and determining a result feature map of the image to be segmented.

The result feature map is used for representing semantic information of the image to be segmented. The image to be segmented can be input into the Alexnet network, the VGG network, the Resnet network, the DenseNet network and other networks to generate a result feature map of the image to be segmented.

Step 103: and determining a plurality of scale feature images of the image to be segmented, and generating a fusion feature image of the image to be segmented by utilizing the plurality of scale feature images.

The scale feature map is a feature map for representing semantic information of the image to be segmented. Different scale feature maps correspond to different resolutions. And a backbone network can be utilized to obtain a plurality of scale feature images of the image to be segmented. For example, the image to be segmented can be input into a ResNet-101 network, and a 1/4 scale feature map, a 1/8 scale feature map and a 1/16 scale feature map of the image to be segmented can be obtained. Wherein the scale corresponds to the resolution of the image. The resolution of the 1/4 scale feature map is 1/4 of the resolution of the image to be segmented, the resolution of the 1/8 scale feature map is 1/8 of the resolution of the image to be segmented, and the resolution of the 1/16 scale feature map is 1/16 of the resolution of the image to be segmented.

The plurality of scale feature maps can comprise feature information of the image to be segmented in a shallow convolution stage and a deep convolution stage. The multi-scale information of the image to be segmented is fused in the fusion feature map, so that the fusion of the multi-scale information is more refined.

Step 104: and obtaining an enhanced feature map of the image to be segmented according to the result feature map and the fusion feature map.

According to the result feature map and the fusion feature map, various methods for obtaining the enhanced feature map of the image to be segmented are available. For example, the resulting feature map and the fused feature map may be input into a specified model to generate an enhanced feature map of the image to be segmented. And the up-sampling processing or the down-sampling processing can be carried out on the obtained feature images and the fused feature images to obtain feature images with the same size, then the obtained feature images are spliced, and finally the feature images are output.

Step 105: and obtaining an image semantic segmentation result of the image to be segmented by utilizing the enhanced feature map.

Fig. 2 is a flow chart of another image semantic segmentation method according to an embodiment of the present invention, as shown in fig. 2, the method includes:

step 201: and acquiring an image to be segmented.

Step 202: and determining an output characteristic diagram of the image to be segmented by using the backbone network.

The backbone network may be a deep learning model that performs semantic segmentation on the image. The backbone network may include: a DenseNet network, a VGG network, a Resnet network, etc.

Step 203: a plurality of target hole convolution kernels is determined.

Multiple target hole convolution kernels may be set as desired. The target hole convolution kernels may have different hole rates. The void fraction can be set according to the requirements. The void fraction may be: 4. 8, 12, 16, etc.

Step 204: for each target hole convolution kernel: and carrying out convolution processing on the output feature map by using the target cavity convolution kernel to obtain a cavity feature map.

And a plurality of target cavity convolution kernels with different scales are arranged in the system. The convolution kernel with low void ratio is beneficial to extracting low-resolution feature map information, so that more complex scenes in the image can be dealt with.

Step 205: and generating a result feature map of the image to be segmented according to the plurality of cavity feature maps.

Step 206: and determining a plurality of scale feature images of the image to be segmented, and generating a fusion feature image of the image to be segmented by utilizing the plurality of scale feature images.

There are various methods for generating a fused feature map of an image to be segmented. For example, a plurality of scale feature maps of the image to be segmented may be input into a specified model to generate a fused feature image of the image to be segmented. And performing bilinear interpolation up-sampling processing on each scale of feature images to obtain feature images with the same size, then performing splicing processing on the obtained feature images, and finally outputting.

Step 207: and obtaining an enhanced feature map of the image to be segmented according to the result feature map and the fusion feature map.

There are various methods for generating an enhanced feature map of an image to be segmented. For example, the resulting feature map and the fused feature map of the image to be segmented may be input into a specified model to generate an enhanced feature map of the image to be segmented. Performing downsampling treatment on the result feature map; and performing splicing processing on the downsampled result feature map and the fusion feature map to generate an enhanced feature map of the image to be segmented. And the up-sampling processing or the down-sampling processing can be carried out on the obtained feature images and the fused feature images to obtain feature images with the same size, then the obtained feature images are spliced, and finally the feature images are output.

Step 208: and obtaining an image semantic segmentation result of the image to be segmented by utilizing the enhanced feature map.

In the embodiment of the invention, the low-resolution feature map information in the image to be segmented is extracted by utilizing a plurality of target cavity convolution kernels, so that the finally generated enhanced feature map contains the shallow layer sub-convolution layer features and the deep layer convolution layer features of the image to be segmented, thereby improving the accuracy of the image semantic segmentation result and enabling the method of the embodiment of the invention to cope with more complex scenes in the image.

In one embodiment of the present invention, generating a result feature map of an image to be segmented according to a plurality of hole feature maps includes: carrying out pooling treatment on the output feature images to obtain pooled feature images of the images to be segmented; and splicing the plurality of cavity feature images and the pooled feature images to generate a result feature image of the image to be segmented.

Fig. 3 is a schematic diagram of an improved void space pyramid pooling module according to an embodiment of the present invention. As shown in fig. 3, the feature map of the input image to be segmented before entering the modified ASPP (Atrous Spatial Pyramaid Pooling Module, hole space pyramid pooling module) is obtained by a backbone network res net, and the size is 1/16, and the channel number is 2048. Then, the characteristic images enter an improved ASPP module, the characteristic images with the size of 6 pieces of 1/16 and the channel number of 256 are obtained after the characteristic images are subjected to 1x1 convolution and 3x3 convolution kernels with the void rate of 4, 8, 12 and 16 respectively and an average pooling layer, and then the characteristic images produced by the ASPP module are obtained after the characteristic images are spliced and fused in the channel dimension. The method provided by the embodiment of the invention can better fuse the low-resolution characteristic and the high-resolution characteristic of the image to be segmented.

Fig. 4 is a flow chart of yet another image semantic segmentation method according to an embodiment of the present invention, as shown in fig. 4, the method includes:

step 401: and acquiring an image to be segmented.

Step 402: and determining a result feature map of the image to be segmented.

Step 403: a plurality of scale feature maps of the image to be segmented is determined.

The scale feature map is a feature map for representing semantic information of the image to be segmented. Different scale feature maps correspond to different resolutions. And a backbone network can be utilized to obtain a plurality of scale feature images of the image to be segmented.

Step 404: and ordering the plurality of scale feature maps according to the resolution from small to large.

Step 405: and determining the scale characteristic map with the minimum resolution from the plurality of scale characteristic maps as the current characteristic map.

Step 406: and determining a later characteristic diagram of the current characteristic diagram from the sequenced multiple scale characteristic diagrams.

Step 407: and carrying out up-sampling processing on the output characteristic diagram of the current characteristic diagram so as to enable the resolution of the processed current characteristic diagram to be equal to the resolution of the later characteristic diagram.

Step 408: and combining the processed current feature map with the subsequent feature map, and determining the combination result as an output feature map of the subsequent feature map.

Step 409: among the ranked plurality of scale feature maps, it is determined whether other feature maps exist behind the latter feature map.

If so, step 410 is performed. If not, the latter feature map is the largest scale feature map of the plurality of scale feature maps, and step 411 is performed.

Step 410: the latter feature map is determined as the current feature map.

Step 406 is re-executed.

Step 411: and determining the output feature map of the scale feature map with the maximum resolution as a fusion feature map of the image to be segmented.

Step 412: and obtaining an image semantic segmentation result of the image to be segmented by utilizing the enhanced feature map.

Fig. 5 is a schematic diagram of a feature pyramid module according to an embodiment of the present invention. As shown in FIG. 5, an embodiment of the present invention adds an optimized feature pyramid module (FPN) to act as a fused branch of the feature generation at each stage of the backbone network. Unlike the above, the features of the higher layer are not always transferred downwards in the path from top to bottom, but are up-sampled 2 times of the adjacent layers and then combined, and then transferred to the subsequent module through the 3x3 convolution, so that the multi-scale information is primarily fused, the fusion of the multi-scale information is more refined, and the semantic segmentation can be performed on the segmented image more accurately.

In one embodiment of the present invention, obtaining an image semantic segmentation result of an image to be segmented using an enhanced feature map includes: the enhanced feature map is input to an attention mechanism module to obtain an attention feature map of the image to be segmented, and the attention mechanism module comprises: the system comprises a channel attention module and/or a spatial attention module, wherein the channel attention module is used for compressing the enhancement feature map in the spatial dimension, and the spatial attention module is used for compressing the enhancement feature map in the channel dimension; and obtaining an image semantic segmentation result of the image to be segmented by using the attention feature map. A spatial attention mechanism module and a channel attention mechanism module are added in the decoding stage so as to better recover the spatial detail information and accelerate the model convergence.

To facilitate an understanding of the method of embodiments of the present invention. Based on the conception of the image semantic segmentation method provided by the embodiment of the invention, the embodiment of the invention also provides an image semantic segmentation model. The image semantic segmentation model designs and realizes a deep Labv 3-based encoder network. The model is based on deep Labv3, and optimizes and improves the deep Labv3 network structure. First, the deep labv3 network uses the res net as the architecture network, and in the encoder stage, the hole space pyramid module used is not so effective in extracting the features of the low resolution feature map object. Therefore, in the embodiment of the invention, besides trying several ResNet networks with different depths as a main network, the spatial pyramid module is fine-tuned, and a hole convolution block with low sampling rate is mainly added, so that the purpose of effectively extracting features of low-resolution feature map objects is achieved. In deep learning based image semantic segmentation networks, both shallow and deep features are important to the final result. However, the deep labv3 network does not perform fusion processing on the shallow convolutional layer features and the deep convolutional layer features.

The image semantic segmentation model designs and implements a decoder network based on an attention mechanism. Another adjustment of the embodiment of the invention is to add a feature pyramid module in the decoder stage, and to perform fusion processing on the feature pyramid network with the shallow layer features and the deep layer features of the main network subjected to optimization processing. And finally, two groups of attention mechanism modules, namely a spatial attention mechanism module and a channel attention mechanism module, are added in the decoding stage so as to better recover the spatial detail information and accelerate the model convergence. Finally, experiments were performed on the city scene dataset CityScape Datasets, demonstrating the effectiveness of the method. Experimental results show that the image semantic segmentation model obtains a better segmentation result on the test set.

Fig. 6 is a schematic structural diagram of an image semantic segmentation model according to an embodiment of the present invention. The image semantic segmentation model provided by the embodiment of the invention is an Encoder-Decoder model, and the backbone network is ResNet-101. As shown in fig. 6, in the encoding stage, the embodiment of the present invention adds a convolution kernel with more scales on the improved ASPP module, and the convolution kernel with low void ratio is beneficial to extracting the feature map information with low resolution, so as to cope with more complex scenes in the image. Fig. 3 shows the feature map of the input image before entering the modified ASPP module, obtained from the backbone network ResNet, with a size of 1/16 and a channel number of 2048. Then, the characteristic images enter an improved ASPP module, the characteristic images with the size of 6 pieces of 1/16 and the channel number of 256 are obtained after the characteristic images are subjected to 1x1 convolution and 3x3 convolution kernels with the void rate of 4, 8, 12 and 16 respectively and an average pooling layer, and then the characteristic images produced by the ASPP module are obtained after the characteristic images are spliced and fused in the channel dimension.

In the decoder stage, the embodiment of the invention adds an optimized feature pyramid module FPN which is used as a merging branch of the feature generated by each stage of the backbone network. In contrast, the higher layer features are not passed down all the way in the top-down path, but instead are 2 times upsampled and then combined with adjacent layers and passed to the subsequent module by a 3x3 convolution, as shown in fig. 5. This is done to primarily fuse the multi-scale information, and to refine the fusion of the multi-scale information.

Then, two modules of attention mechanisms, namely a channel attention module (Chnnel Attention Module, CAM) and a spatial attention module (Spacial Attention Module, SAM) are respectively adopted to better recover the spatial detail information and accelerate the model convergence. And finally, the final result is obtained by combining and outputting the results generated by the channel attention module and the space attention module and up-sampling by 4 times. In summary, one end-to-end segmentation of the image is completed.

The embodiment of the invention provides an image semantic segmentation method based on a deep Labv3+ network by taking an urban scene image dataset CityScape Datasets as a basis, aiming at the problems of low segmentation precision and the like caused by complex scenes of images and combining with the previous excellent image semantic segmentation model based on deep learning to make improvement. The method can segment images with complex scenes such as urban scene images, and improves the segmentation accuracy to a certain extent.

Fig. 7 is a schematic structural diagram of an image semantic segmentation apparatus according to an embodiment of the present invention, as shown in fig. 7, where the apparatus includes:

an image acquisition module 701, configured to acquire an image to be segmented;

a first determining module 702, configured to determine a result feature map of the image to be segmented, where the result feature map is used to characterize semantic information of the image to be segmented;

a second determining module 703, configured to determine a plurality of scale feature maps of the image to be segmented, and generate a fusion feature map of the image to be segmented using the plurality of scale feature maps;

the enhancement module 704 is configured to obtain an enhancement feature map of the image to be segmented according to the result feature map and the fusion feature map;

and the semantic segmentation module 705 is configured to obtain an image semantic segmentation result of the image to be segmented by using the enhanced feature map.

Optionally, the first determining module 702 is specifically configured to:

determining a plurality of target hole convolution kernels;

Optionally, the first determining module 702 is specifically configured to:

Optionally, the second determining module 703 is specifically configured to:

determining the latter feature map as a current feature map;

Optionally, the enhancing module 704 is specifically configured to:

performing downsampling processing on the result feature map;

Optionally, the semantic segmentation module 705 is specifically configured to:

The embodiment of the invention provides electronic equipment, which comprises:

one or more processors;

storage means for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the methods of any of the embodiments described above.

Referring now to FIG. 8, there is illustrated a schematic diagram of a computer system 800 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU) 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 801.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: the device comprises an image acquisition module, a first determination module, a second determination module, an enhancement module and a semantic segmentation module. The names of these modules do not in any way limit the module itself, for example, the image acquisition module may also be described as "module for acquiring an image to be segmented".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:

acquiring an image to be segmented;

determining a result feature map of the image to be segmented;

According to the technical scheme provided by the embodiment of the invention, the fusion feature map of the image to be segmented is generated by utilizing the multiple scale feature maps of the image to be segmented. The plurality of scale feature maps may include feature information of the image in a shallow convolution stage and a deep convolution stage. And generating a fusion feature map of the image to be segmented by utilizing the multiple scale feature maps of the image to be segmented. The fusion characteristic diagram comprises fusion characteristic information generated by the image in a shallow convolution stage and a deep convolution stage. And generating an enhanced feature map according to the result feature map and the fusion feature map of the image to be segmented. Because the enhanced feature map contains the fusion feature information generated by the image in the shallow convolution stage and the deep convolution stage, the enhanced feature map can be used for more accurately carrying out semantic segmentation on the image.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. An image semantic segmentation method, comprising:

acquiring an image to be segmented;

2. The method of claim 1, wherein the determining the resulting feature map of the image to be segmented comprises:

determining a plurality of target hole convolution kernels;

3. The method according to claim 2, wherein generating a resulting feature map of the image to be segmented from a plurality of the hole feature maps comprises:

4. The method of claim 1, wherein generating the fused feature map of the image to be segmented using the plurality of scale feature maps comprises:

5. The method of claim 4, wherein generating the fused feature map of the image to be segmented from the output feature map of the subsequent feature map comprises:

determining the latter feature map as a current feature map;

6. The method according to claim 1, wherein the obtaining the enhanced feature map of the image to be segmented according to the result feature map and the fusion feature map includes:

performing downsampling processing on the result feature map;

7. The method according to claim 1, wherein the obtaining the image semantic segmentation result of the image to be segmented using the enhancement feature map includes:

8. An image semantic segmentation apparatus, comprising:

the image acquisition module is used for acquiring an image to be segmented;

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.