CN113591859A

CN113591859A - Image segmentation method, apparatus, device and medium

Info

Publication number: CN113591859A
Application number: CN202110701799.7A
Authority: CN
Inventors: 周争光; 姚聪; 王鹏; 陈坤鹏
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-11-02

Abstract

The embodiment of the invention provides an image segmentation method, an image segmentation device, image segmentation equipment and an image segmentation medium, belongs to the technical field of image processing, and aims to improve the accuracy of image segmentation, wherein the method comprises the following steps: obtaining a feature map of an image to be segmented; extracting the features of the image to be segmented in multiple scales to obtain feature maps of multiple scales; processing the characteristic graphs of multiple scales according to a target dimension, and determining the weight values of the characteristic graphs of multiple scales in the target dimension; fusing the feature graphs of the multiple scales according to the weight values of the feature graphs of the multiple scales in the target dimension to obtain a fused feature graph; and segmenting the image to be segmented according to the fusion feature map to obtain the category to which each pixel point included in the image to be segmented belongs.

Description

Image segmentation method, apparatus, device and medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image segmentation method, an image segmentation apparatus, an image segmentation device, and an image segmentation medium.

Background

In recent years, technical research based on artificial intelligence, such as computer vision, deep learning, machine learning, image processing, and image recognition, has been actively developed. Artificial Intelligence (AI) is an emerging scientific technology for studying and developing theories, methods, techniques and application systems for simulating and extending human Intelligence. The artificial intelligence subject is a comprehensive subject and relates to various technical categories such as chips, big data, cloud computing, internet of things, distributed storage, deep learning, machine learning and neural networks. Computer vision is used as an important branch of artificial intelligence, particularly a machine is used for identifying the world, and the computer vision technology generally comprises the technologies of face identification, living body detection, fingerprint identification and anti-counterfeiting verification, biological feature identification, face detection, pedestrian detection, target detection, pedestrian identification, image processing, image identification, image semantic understanding, image retrieval, character identification, video processing, video content identification, behavior identification, three-dimensional reconstruction, virtual reality, augmented reality, synchronous positioning and map construction (SLAM), computational photography, robot navigation and positioning and the like. With the research and progress of artificial intelligence technology, the technology is applied to various fields, such as security, city management, traffic management, building management, park management, face passage, face attendance, logistics management, warehouse management, robots, intelligent marketing, computational photography, mobile phone images, cloud services, smart homes, wearable equipment, unmanned driving, automatic driving, smart medical treatment, face payment, face unlocking, fingerprint unlocking, testimony verification, smart screens, smart televisions, cameras, mobile internet, live webcasts, beauty treatment, medical beauty treatment, intelligent temperature measurement and the like.

Among them, semantic segmentation is commonly used in image recognition and image processing, and is an important task in many practical applications such as automatic driving, medical image analysis, video retrieval, and the like. Among them, semantic segmentation is mainly used in classification tasks, i.e. to identify the categories of multiple targets in an image, and in semantic segmentation, how to effectively acquire multi-scale high-level semantic information from an image is a problem. In the related art, high-level semantic information of different scales is generally fused to effectively use the high-level semantic information, but the high-level semantic information of different scales is simply connected through channels, so that the high-level semantic information of different scales cannot be sufficiently mined and fused, and the semantic segmentation accuracy is low.

Disclosure of Invention

In view of the above problems, an image segmentation method, apparatus, device and medium according to embodiments of the present invention are proposed to overcome or at least partially solve the above problems.

In order to solve the above problem, a first aspect of the present invention discloses an image segmentation method, including:

obtaining a feature map of an image to be segmented;

extracting the features of the image to be segmented in multiple scales to obtain feature maps of multiple scales;

processing the characteristic graphs of multiple scales according to a target dimension, and determining the weight values of the characteristic graphs of multiple scales in the target dimension;

fusing the feature graphs of the multiple scales according to the weight values of the feature graphs of the multiple scales in the target dimension to obtain a fused feature graph;

and segmenting the image to be segmented according to the fusion feature map to obtain the category to which each pixel point included in the image to be segmented belongs.

Optionally, the method further comprises:

carrying out global feature extraction on the feature map of the image to be segmented to obtain a global feature map;

according to the fusion feature map, segmenting the image to be segmented to obtain the category to which each pixel point included in the image to be segmented belongs, including:

and segmenting the image to be segmented according to the fusion feature map, the global feature map and the feature map of the image to be segmented to obtain the category to which each pixel point included in the image to be segmented belongs.

Optionally, segmenting the image to be segmented according to the fusion feature map, the global feature map, and the feature map of the image to be segmented to obtain categories to which each pixel included in the image to be segmented belongs, including:

splicing the fusion feature map, the global feature map and the feature map of the image to be segmented according to the channel dimension to obtain a spliced feature map;

and performing convolution processing on the splicing characteristic graph to obtain the category to which each pixel point included in the image to be segmented belongs.

Optionally, processing the feature maps of multiple scales according to a target dimension, and determining weight values of the feature maps of multiple scales in the target dimension includes:

fusing the eigenvalues of the same target dimension in the characteristic images of multiple scales to obtain a three-dimensional tensor of the target dimension;

and obtaining the weight values of the feature maps of the multiple scales in the target dimension according to the three-dimensional tensor of the target dimension.

Optionally, the target dimension is a channel dimension; fusing the eigenvalues of the same target dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the target dimension, comprising the following steps:

adding the eigenvalues of the same channel dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the channel dimension;

obtaining the weight values of the feature maps of the multiple scales in the target dimension according to the three-dimensional tensor of the target dimension, wherein the weight values comprise:

sequentially inputting the three-dimensional tensor of the channel dimension into a global average pooling layer and a first full-connection layer to obtain a one-dimensional tensor of the channel dimension;

inputting the one-dimensional tensor of the channel dimension into the full connection layer corresponding to each of the multiple dimensions to obtain the one-dimensional tensor of each of the multiple dimensions in the channel dimension;

and normalizing the one-dimensional tensors of the multiple scales in the channel dimension respectively to obtain the weight values of the feature maps of the multiple scales in the channel dimension respectively.

Optionally, the target dimension is a spatial dimension; fusing the eigenvalues of the same target dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the target dimension, comprising the following steps:

splicing the eigenvalues of the same space dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the space dimension;

inputting the three-dimensional tensor of the space dimension into the convolution layer to obtain two-dimensional tensors of various dimensions in the space dimension;

and normalizing the two-dimensional tensors of the multiple scales in the space dimension respectively to obtain the weight values of the feature maps of the multiple scales in the space dimension respectively.

Optionally, the target dimensions include spatial dimensions and channel dimensions;

according to the weighted values of the feature maps of the multiple scales in the target dimension, fusing the feature maps of the multiple scales to obtain a fused feature map, wherein the fused feature map comprises:

fusing the feature graphs of the multiple scales according to the weight values of the feature graphs of the multiple scales in the channel dimension to obtain a fused feature graph of the channel dimension;

fusing the feature maps of the multiple scales according to the weight values of the feature maps of the multiple scales in the space dimension to obtain a fused feature map of the space dimension;

and adding or splicing the characteristic values corresponding to the positions on the fusion characteristic diagram of the channel dimension and the fusion characteristic diagram of the space dimension to obtain the fusion characteristic diagram.

Optionally, the feature extraction of multiple scales is performed on the feature map of the image to be segmented to obtain feature maps of multiple scales, including:

and inputting the characteristic diagram of the image to be segmented into a plurality of cavity convolution layers with different step lengths to obtain the characteristics with various scales.

In a second aspect of the embodiments of the present invention, there is provided an image segmentation apparatus, including:

the characteristic image obtaining module is used for obtaining a characteristic image of the image to be segmented;

the multi-scale feature extraction module is used for extracting features of multiple scales from the feature map of the image to be segmented to obtain the feature map of the multiple scales;

the attention module is used for processing the feature maps of the multiple scales according to a target dimension and determining the weight values of the feature maps of the multiple scales in the target dimension;

the fusion module is used for fusing the feature maps of the multiple scales according to the weight values of the feature maps of the multiple scales in the target dimension to obtain a fusion feature map;

and the segmentation module is used for segmenting the image to be segmented according to the fusion feature map to obtain the category to which each pixel point included in the image to be segmented belongs.

In a third aspect of the embodiments of the present invention, an electronic device is further disclosed, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image segmentation method as described in the embodiments of the first aspect when executed.

In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is further disclosed, which stores a computer program for causing a processor to execute the image segmentation method according to the first aspect of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the characteristic extraction of multiple scales of the characteristic diagram of the image to be segmented can be obtained, the characteristic diagrams of multiple scales are obtained, then the characteristic diagrams of multiple scales are processed according to the target dimension, and the weight values of the characteristic diagrams of multiple scales in the target dimension are determined; then, fusing the feature graphs of multiple scales according to the weight values of the feature graphs of multiple scales in the target dimension to obtain a fused feature graph; and then, according to the fusion characteristic graph, segmenting the image to be segmented to obtain the category of each pixel point included in the image to be segmented, so as to finish the semantic segmentation of the image to be segmented.

In the embodiment, the feature maps of multiple scales are processed according to the target dimension to obtain the weight values of the feature maps of multiple scales on the target dimension, and the feature maps of multiple scales are fused according to the weight values, so that the fusion of the importance of the feature maps of multiple scales on the target dimension is realized, high-level semantic information of different scales is fully mined and fused, the fused feature maps can fully reflect the high-level semantic information of the image to be segmented, a more accurate semantic segmentation result can be obtained, and the accuracy of semantic segmentation is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of an image segmentation method in an embodiment of the invention;

FIG. 2 is a block diagram of an image segmentation model in accordance with the present invention;

FIG. 3 is a schematic overall flow chart of the processing of feature maps at multiple scales according to channel dimensions in the practice of the present invention;

FIG. 4 is a schematic diagram of the processing of feature maps at various scales according to channel dimensions in the practice of the present invention;

FIG. 5 is a schematic overall flow chart of the processing of multiple scale feature maps in terms of spatial dimensions in the practice of the present invention;

FIG. 6 is a schematic diagram of the processing of multiple scale feature maps in terms of spatial dimensions in accordance with the practice of the present invention;

FIG. 7 is a flow chart of the steps in the implementation of the present invention for fusing multiple scales of feature maps from spatial dimensions and channel dimensions;

FIG. 8 is a block diagram of an image segmentation apparatus in accordance with an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below to clearly and completely describe the technical solutions in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The high-level semantic information is the characteristic information obtained after convolution (characteristic extraction) for a plurality of times, the receptive field is large, the extracted characteristics are more and more abstract, and the classification of objects is facilitated. In the related art, fusion of multi-scale high-level semantic information is proposed to strengthen semantic information under different scales, however, the method in the related art is only to simply perform channel connection on the high-level semantic information of different scales, for example, the high-level semantic information of different scales is spliced on channel dimensions, and this way cannot fully mine and fuse the high-level semantic information of different scales, so that it is not beneficial to accurate semantic segmentation.

In view of the above, the present application proposes the following technical concepts: and (3) introducing an attention mechanism on the space dimension and/or the channel dimension to model semantic features of different scales, obtaining weight values (importance) of multi-scale high-level semantic information (feature map) on the space dimension and the channel dimension, performing feature fusion according to the weight values, and performing semantic segmentation based on the fused feature map. Therefore, the feature maps of various scales are fused according to the importance of the feature maps on the space dimension and/or the channel dimension, so that high-level semantic information of different scales is fully mined and fused, the fused feature maps can fully reflect the high-level semantics of the image to be segmented, a more accurate semantic segmentation result can be obtained, and the accuracy of semantic segmentation is improved.

Referring to fig. 1, a flowchart illustrating steps of an image segmentation method according to an embodiment of the present application is shown, and as shown in fig. 1, the method may specifically include the following steps:

step S101: and obtaining a characteristic map of the image to be segmented.

In this embodiment, feature extraction may be performed on an image to be segmented, so as to obtain a feature map of the image to be segmented. Specifically, the expansion convolution parameters of the last two stages (stages) of the model can be changed to make the size of the output feature map be 1/8 of the image to be segmented, so that the calculation amount of feature fusion in the subsequent process can be reduced, and the image segmentation efficiency can be improved.

Step S102: and carrying out feature extraction on the feature map of the image to be segmented in multiple scales to obtain the feature map in multiple scales.

In this embodiment, feature extraction of multiple scales may be performed on the feature map of the image to be segmented, where the feature extraction of multiple scales may refer to performing convolution processing of different step sizes on the feature map of the image to be segmented, for example, 1/12/24/36 and the like may be used for different step sizes, where the obtained multiple scales may be three, four and the like, and may be specifically determined according to actual requirements.

The characteristic map of the image to be segmented is subjected to characteristic extraction of multiple scales, and the characteristics of the image to be segmented can be described from the receptive fields of different sizes, so that high-level semantic information of different scales can be extracted, namely the characteristic map of multiple scales of the embodiment of the application can reflect the high-level semantic information of the image to be segmented from different receptive fields.

Step S103: and processing the characteristic graphs of the multiple scales according to the target dimension, and determining the weight values of the characteristic graphs of the multiple scales in the target dimension.

In this embodiment, the target dimension may include a space dimension and/or a channel dimension, where processing the feature maps of multiple scales according to the target dimension may refer to: and transforming the features of the other dimensions except the target dimension to reflect the features of the feature maps of various scales on the target dimension, so as to obtain the difference of the features of the feature maps of various scales on the target dimension. For example, taking the target dimension as a spatial dimension as an example, the difference of the features of different spatial positions of the feature map of each scale in the spatial dimension can be reflected. The difference may reflect the importance of each of the feature maps of multiple scales in the target dimension, that is, reflect the weight value of each of the feature maps of multiple scales in the target dimension.

In specific implementation, dimension transformation, feature addition, average pooling, full-link processing and the like can be performed on the feature maps of multiple scales, so that the difference of the features of the feature maps of multiple scales on the target dimension is obtained, namely the feature maps of different scales correspond to different weight values on the target dimension. Specifically, an attention map of the feature maps of multiple scales on the target dimension can be obtained, and the values in the attention map can sufficiently reflect the weight values of the feature maps of each scale on the target dimension from the perspective of the target dimension.

The weight value of the feature map of one scale in the target dimension may represent the importance of the feature map of the scale in the target dimension, or may be understood as the proportion of the feature map of the scale in the multi-scale fusion.

For example, taking feature extraction on feature maps of images to be segmented according to different step sizes 1/12/24/36 as an example, feature maps of four different scales are obtained, namely, a feature map X1, a feature map X2, a feature map X3, and a feature map X4, and then, feature map X1, feature map X2, feature map X3, and feature map X4 are processed in a target dimension, so that weight values of feature map X1, feature map X2, feature map X3, and feature map X4 in the target dimension are obtained.

Due to the fact that the features of the image to be segmented can be described from the receptive fields with different sizes by means of feature extraction with multiple scales, the weight values of the feature graphs of the multiple scales in the target dimension are obtained, and the weight values of the feature graphs of the receptive fields with different sizes are obtained, so that different proportions occupied by the feature graphs of the image to be segmented mined with different scales can be obtained, and the feature graphs of the multiple scales can be fused conveniently in the follow-up process.

Step S104: and fusing the feature graphs of the multiple scales according to the weight values of the feature graphs of the multiple scales in the target dimension to obtain a fused feature graph.

In this embodiment, the feature values in the feature maps of multiple scales may be weighted and summed according to the weight values of the feature maps of multiple scales in the target dimension, so as to obtain a fused feature map. The fusion characteristic graph can reflect the high-level semantic information obtained after fully mining and fusing the high-level semantic information of the image to be segmented in different scales on the target dimension. The fused feature map can fully reflect the high-level semantics of the image to be segmented, so that a more accurate semantic segmentation result can be obtained, and the accuracy of semantic segmentation is improved.

Taking the above example as an example, after obtaining the weight values of each of the feature map X1, the feature map X2, the feature map X3, and the feature map X4 in the target dimension, the feature map X1, the feature map X2, the feature map X3, and the feature map X4 may be weighted and summed according to the weight values, so as to obtain the fused feature map.

Step S105: and segmenting the image to be segmented according to the fusion feature map to obtain the category to which each pixel point included in the image to be segmented belongs.

The semantic segmentation task is a pixel-level classification task, and needs to predict a category to which each pixel of an input image to be segmented belongs, so that after a fusion feature map is obtained, because the fusion feature map can sufficiently reflect high-level semantic information of the image to be segmented, when the image to be segmented is segmented according to the fusion feature map, a more accurate image segmentation result can be obtained, and particularly, the category to which each pixel included in the image to be segmented belongs can be obtained.

In a concrete implementation, different categories can be marked with different colors, so, in the classification result of output, the color values of the pixel points belonging to different categories in the image to be segmented can be different, and the color values of the pixel points belonging to the same category are the same.

By adopting the technical scheme of the embodiment of the application, the feature maps of various scales are processed according to the target dimension to obtain the weight values of the feature maps of various scales on the target dimension, and the feature maps of various scales are fused according to the weight values, so that the fusion of the importance of the feature maps of various scales on the target dimension is realized, the high-level semantic information of different scales is fully mined and fused, the fused feature maps can fully reflect the high-level semantic information of the image to be segmented, a more accurate semantic segmentation result can be obtained, and the accuracy of semantic segmentation is improved.

Referring to fig. 2, a general framework diagram of an image segmentation model for performing image segmentation according to an embodiment of the present application is shown.

In the related art, image segmentation is generally performed using, as an image segmentation model, a CNN neural network, which is a convolution network model for extracting image features and may be a neural network model such as ResNet50/ResNet 101. In order to implement the image segmentation method of the present application, unlike the related art, an attention module is added to a conventional image segmentation model, for example, an attention module is added to a last feature transformation module in the conventional image segmentation model. The image segmentation model after adding the attention module is shown in fig. 2, the attention module shown by the dashed line box in fig. 2 is a module newly added to the model, and the rest modules can be regarded as original modules of the model.

As shown in fig. 2, the image segmentation model includes a feature extraction module, an attention module and a global average pooling layer connected to an output end of the feature extraction module, a fusion module simultaneously connected to an output end of the attention module and an output end of the global average pooling layer, and a convolution module connected to an output end of the fusion module. The feature graph output by the output end of the feature extraction module is respectively input to the attention module, the global average pooling layer and the fusion module.

The attention module can be used for extracting the features of the image to be segmented in multiple scales to obtain feature maps of multiple scales; and processing the feature maps of multiple scales according to the target dimension, determining the weight values of the feature maps of the multiple scales in the target dimension, and fusing the feature maps of the multiple scales according to the weight values of the feature maps of the multiple scales in the target dimension to obtain a fused feature map.

The global average pooling layer may be configured to perform global pooling on the feature map output by the feature extraction module.

The fusion module is used for fusing the fusion feature map output by the attention module, the global feature map output by the global average pooling layer and the feature map output by the feature extraction module, so that the image to be segmented can be segmented based on the fused feature map.

Specifically, the fused feature map may be input to a convolution layer, where the convolution layer may be a1 × 1 convolution, and the 1 × 1 convolution is mainly used to perform convolution processing on the fused feature map, and output a category to which each pixel included in the segmented image belongs.

The obtaining process of the image segmentation model may be: and training the preset model by adopting a training data set carrying the label, wherein the model structure of the preset model is the same as that of the image segmentation model. The training data set is composed of a pair of image pairs with the same size, each image pair comprises a three-channel color image and a single-channel image carrying a label, each category is represented by the labels with different colors, and the labels in the single-channel image are used for representing the real categories to which the pixel points belong. The image segmentation model adopts a gradient descent method for iterative training, and the loss function of the image segmentation model generally adopts a cross entropy loss function.

Next, an image segmentation method in an embodiment of the present application will be described with reference to an overall framework of an image segmentation model shown in fig. 2.

In this embodiment, when feature extraction of multiple scales is performed on the feature map of the image to be segmented to obtain feature maps of multiple scales, the feature map of the image to be segmented may be input into a plurality of void convolution layers of different step lengths to obtain the feature maps of multiple scales.

The hole convolution layer is formed by injecting holes into a standard convolution kernel so as to increase the receptive field. Compared with the original normal convolution layer, the cavity convolution layer has one more hyper-parameter, which is called expansion rate, and the hyper-parameter refers to the interval number of kernel. The receptive field can be arbitrarily expanded through the processing of the void convolution layer, so that characteristic graphs of various scales can be obtained according to any requirements on high-level semantics.

Because the high-level semantic information is the characteristic information obtained after the hole convolution layers with different step lengths, some detailed information can be lost in practice, which is not beneficial to accurate segmentation. In order to solve the problem and improve the accuracy of semantic segmentation, in this embodiment, the feature map of the image to be segmented may be subjected to global feature extraction to obtain a global feature map, and the global feature map may retain detail information of the image to be segmented.

In specific implementation, when the feature map of the image to be segmented is subjected to global feature extraction, the feature map of the image to be segmented may be input to a global average pooling layer for processing, where the global average pooling layer may be a global average pooling unit, and in one example, the feature map to be segmented may further include a1 × 1 convolution unit, where the 1 × 1 convolution unit may be connected in series to an output end of the global average pooling layer to transform the dimension of the feature map output by the global average pooling layer, and the feature map of the image to be segmented is subjected to convolution operation of the global average pooling layer and the 1 × 1 to obtain the global feature map. The channel dimension of the finally output global feature map and the channel dimension of the fused feature map can be made to be consistent through the 1 × 1 convolution unit, and of course, in some other embodiments, the channel dimension of the finally output global feature map and the dimension of the actual requirement can also be made to be consistent through the 1 × 1 convolution unit.

Correspondingly, in step S105, that is, when the image to be segmented is segmented according to the fusion feature map to obtain the category to which each pixel point included in the image to be segmented belongs, the category to which each pixel point belongs may be determined in the following manner:

In specific implementation, the fused feature map can fully reflect the high-level semantics of the image to be segmented, the global feature map is obtained after the feature map of the image to be segmented is subjected to global feature extraction, the global information of the image to be segmented and the feature map of the image to be segmented are obtained after the feature extraction of the image to be segmented and can be used as a residual feature, when the image to be segmented is segmented according to the fused feature map, the global feature map and the feature map of the image to be segmented, the image to be segmented can be considered to be image segmentation fused with the high-level semantics information, the global information and the original feature information of the image to be segmented, and therefore the accuracy of segmenting the image to be segmented can be improved.

In a specific implementation, when the image to be segmented is segmented according to the fusion feature map, the global feature map, and the feature map of the image to be segmented to obtain the category to which each pixel included in the image to be segmented belongs, the following process may be performed:

firstly, the fusion feature map, the global feature map and the feature map of the image to be segmented are spliced according to the channel dimension to obtain a spliced feature map.

In this embodiment, the channel dimension generally refers to the number of channels of the feature graph, and splicing the feature graph according to the channel dimension may refer to: and performing end-to-end connection on the fusion feature map, the global feature map and the feature map of the image to be segmented according to channels to obtain a spliced feature map, wherein the number of the channels of the spliced feature map is the sum of the number of the channels of the fusion feature map, the global feature map and the feature map of the image to be segmented. Assuming that there are 20 channels in the fused feature map, 10 channels in the fused feature map, and 20 channels in the feature map of the image to be segmented, there are 50 channels after the three feature maps are subjected to ending stitching, and the stitched feature map may include all the features of the fused feature map, the global feature map, and the feature map of the image to be segmented.

And secondly, performing convolution processing on the splicing characteristic graph to obtain the category to which each pixel point included in the image to be segmented belongs.

In this embodiment, since the stitched feature map includes all the features of the fused feature map, the global feature map, and the feature map of the image to be segmented, it can be understood that each position on the stitched feature map includes various features of the fused feature map, the global feature map, and the feature map of the image to be segmented at the position. Therefore, the splicing feature map can represent more detailed and more accurate high-level semantic information, and thus, when the splicing feature map is subjected to convolution processing, the splicing feature map can be input into one or two layers of convolution, so that the category of each pixel point included in the image to be segmented is obtained.

In this embodiment, the convolution processing is used to output the probability that the pixel belongs to each category.

When the embodiment is adopted, the global feature map can retain the detail information of the image to be segmented, the feature map of the image to be segmented can also retain the detail information of the image to be segmented under the condition of low receptive field, and the splicing feature map comprises all the features of the fusion feature map, the global feature map and the feature map of the image to be segmented, so that the splicing feature map comprises the abstract high-level semantic information and the detail information of the image to be segmented, thereby being beneficial to accurate segmentation and achieving more accurate segmentation accuracy.

In this embodiment, the target dimension may be a space dimension and/or a channel dimension, that is, the feature maps of multiple scales may be processed according to the space dimension or the channel dimension, or the feature maps of multiple scales may be processed according to the space dimension and the channel dimension.

Next, how to process the feature maps of multiple scales according to the target dimension and determine the weight values of the feature maps of multiple scales in the target dimension will be described.

Firstly, processing the characteristic diagrams of multiple scales according to a target dimension, and fusing characteristic values of the same target dimension in the characteristic diagrams of multiple scales when determining the weight values of the characteristic diagrams of multiple scales in the target dimension to obtain a three-dimensional tensor of the target dimension; and obtaining the weight values of the feature maps of various scales in the target dimension according to the three-dimensional tensor of the target dimension.

In this embodiment, fusing the feature values of the same target dimension in the feature maps of multiple scales may refer to: the method comprises the steps of transforming the features of other dimensions except the target dimension to the target dimension, and adding or splicing the feature values belonging to the same target dimension in the feature maps of various scales during specific implementation, so as to obtain a three-dimensional tensor of the feature maps of various scales in the target dimension, wherein the three-dimensional tensor can reflect the feature distribution of the feature maps of various scales in the target dimension.

Then, the three-dimensional tensor of the target dimension can be processed to obtain the weight values of the feature maps of different scales in the target dimension. Specifically, the following is illustrated by the channel dimension and the space dimension, respectively:

the first mode is as follows: and processing the characteristic graphs of various scales according to the channel dimension.

Correspondingly, referring to fig. 3, an overall flowchart illustrating processing of feature maps of multiple scales according to channel dimensions is shown, and as shown in fig. 3, the method specifically includes the following steps:

when feature values of the same target dimension in feature maps of multiple scales are fused to obtain a three-dimensional tensor of the target dimension, the following step S301 may be performed:

step S301: and adding the eigenvalues of the same channel dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the channel dimension.

In this embodiment, the channel dimension refers to a dimension of a feature map, for example, one feature map represents one channel, but in practice, the channel dimension may also reflect detection of a certain feature of an image, and detection of different features has different channels, for example, 3 features of an image are detected, so as to obtain feature maps of 3 channels. The number of channels may be different according to actual requirements, for example, if there are 20 feature maps in one scale, the number of channels in the scale is 20, and of course, in practice, 40 channels, 30 channels, and the like may also be used. Adding feature values of the same channel dimension in feature maps of multiple scales can refer to: adding the characteristic maps belonging to the same channel in the characteristic maps of various scales to obtain a three-dimensional tensor of the channel dimension, wherein the three dimension can be H multiplied by W multiplied by C, namely the length, the width and the channel.

As shown in fig. 3, the multiple scales of feature maps obtained by performing the hole convolution processing with different steps on the feature map X include a feature map X1, a feature map X2, a feature map X3, and a feature map X4, where X1 ═ Conv (X,1), X2 ═ Conv (X,12), X3 ═ Conv (X,24), and X4 ═ Conv (X, 36). Then, the 4 feature maps are stitched together to obtain Xc ═ Concat (X1, X2, X3, and X4), where Xc is a four-dimensional tensor of 4 × H × W × C, including 4H × W × C three-dimensional tensors: x1, X2, X3 and X4, and then adding the 4 hxwxc three-dimensional tensors included in Xc according to the channel dimension, namely adding eigenvalues of the same channel dimension in the 4 hxwxc three-dimensional tensors to obtain the hxwxc three-dimensional tensor Xs.

As shown in fig. 4, a schematic diagram of a principle of processing feature maps of multiple scales according to channel dimensions is shown, as shown in fig. 4, taking the number of channels as 3 as an example, adding features of a feature map X1 on a channel 1 to features of a feature map X2 on the channel 1 actually means adding pixel values of pixels of the feature map at each spatial position on the channel 1, as shown in fig. 4, adding feature values of the feature map X1 and the feature map X2 corresponding to the same spatial position S on the channel 1 to obtain an added feature value 5 of the spatial position S, and so on, obtaining a three-dimensional tensor Xs, thereby realizing transformation of features in the spatial dimensions to the channel dimensions and realizing fusion of spatial information on the channel.

In an example, the feature maps of multiple scales may be spliced first, specifically, the feature maps may be spliced end to end, and feature addition is performed on the spliced feature maps according to channel dimensions, so as to obtain a three-dimensional tensor.

The feature images of various scales are subjected to feature addition according to the channel dimension, so that the features in the space dimension are converted to the channel dimension, and the three-dimensional tensor Xs obtained after the feature addition is still three-channel. As shown in fig. 4, after the features of the feature map X1 in the channel dimension are added to the features of the feature map X2 in the channel, the obtained three-dimensional tensor of the channel dimension is shown as a feature map Xs in the graph, and it can be seen that fusion of the features belonging to the same channel in the feature maps of different scales is achieved, so that the obtained three-dimensional tensor can be understood as that the features in the spatial dimensions of the feature maps of different scales are uniformly transformed to the channel dimension, that is, the features in one channel in the three-dimensional tensor are fused with the features of each spatial position of the feature maps of multiple scales on the channel.

When obtaining the weight values of the feature maps of the multiple scales in the target dimension according to the three-dimensional tensor of the target dimension, the following steps S302 to S304 may be performed:

step S302: and sequentially inputting the three-dimensional tensor of the channel dimension into the global average pooling layer and the first full-connection layer to obtain the one-dimensional tensor of the channel dimension.

In this embodiment, the three-dimensional tensor of the channel dimension fuses spatial information on the feature maps of different scales and still includes features of different spatial positions, so that, to obtain weights of different channels, the three-dimensional tensor can be sequentially input into the global average pooling layer and the first full connection layer to realize dimension transformation of the three-dimensional tensor, and a one-dimensional tensor of the channel dimension is obtained.

As shown in fig. 3, we can pool Xs globally and averagely to obtain a one-dimensional tensor F1 with length C, and then process F1 with the first full connection layer FC to obtain a one-dimensional tensor F2 with length C/8.

The global averaging pooling layer can output a value by global averaging of the characteristic maps, namely, a tensor of W, H, C is changed into a tensor of 1, C, so that the characteristics of the whole of the characteristic maps in the spatial dimension are integrated, and the one-dimensional tensor can reflect the difference of different channels.

As shown in fig. 4, after the three-dimensional tensor Xs is input into the global average pooling layer and the first full connection layer, the length of the obtained one-dimensional tensor F2 is C, which can be understood as integrating the features of all spatial positions on each channel of the three-dimensional tensor Xs into one value, so that the value of the one-dimensional tensor F2 represents the feature difference of different channels, and the length of the features is the number of feature maps, that is, the number of channels.

Step S303: and inputting the one-dimensional tensor of the channel dimension into the full connection layer corresponding to each of the multiple dimensions to obtain the one-dimensional tensor of each of the multiple dimensions in the channel dimension.

In this embodiment, a corresponding full-connected layer may be preset for feature extraction of multiple scales, for example, for feature extraction of a scale, a full-connected layer a is correspondingly set, and for feature extraction of B scale, a full-connected layer B is correspondingly set. In this way, after the one-dimensional tensor of the channel dimension is obtained, the one-dimensional tensor can be respectively input to the full connection layer corresponding to each of the multiple scales, so that the one-dimensional tensor of each of the multiple scales in the channel dimension is obtained, that is, the number of the one-dimensional tensor is the same as the number of the feature maps of the multiple scales, and the one-dimensional tensor corresponds to the feature maps of the multiple scales.

As shown in fig. 3, after obtaining the one-dimensional tensor F2 with the length of C/8, the F2 may be processed by 4 fully-connected FCs with different parameters (including: FC1, FC2, FC3, and FC4), and then the processing results are concatenated to obtain a two-dimensional tensor Fa of 4 × C, that is, Fa includes 4 one-dimensional tensors with the length of C: FC1(F2), FC2(F2), FC3(F2), FC4 (F2).

The one-dimensional tensor of the channel dimension can reflect the feature difference of different channels, and the difference is the difference of the whole feature maps of various scales on the channel dimension, which is not enough to reflect the difference of the feature maps of each scale on the channel dimension, that is, the one-dimensional tensor needs to be dispersed to obtain the difference of the whole feature maps of each scale on the channel dimension. In specific implementation, the one-dimensional tensor of the channel dimension is input into the full connection layer corresponding to each of the multiple dimensions, so that the one-dimensional tensors of the multiple dimensions in the channel dimension can be obtained, and the difference of the characteristic diagrams of each dimension in the channel dimension is obtained.

Step S304: and normalizing the one-dimensional tensors of the multiple scales in the channel dimension respectively to obtain the weight values of the feature maps of the multiple scales in the channel dimension respectively.

In this embodiment, normalizing the one-dimensional tensors of the multiple scales in the channel dimension may be: normalizing the value of each scale in the one-dimensional tensor of the channel dimension to a value between 0 and 1, thereby obtaining the weight value of the characteristic diagram of each scale in the channel dimension, namely obtaining the weight value of each channel in different dimensions. In particular, a sigmod function may be utilized. As shown in FIG. 3, for example, the weight of channel 1 in step 1, the weight in step 12, the weight in step 24, and the weight in step 36 are obtained

The one-dimensional tensor of the feature map of each scale in the channel dimension may also be a tensor of 1 × C, and then the tensor of 1 × C is normalized, so that the value in 1 × C is normalized to a value between 0 and 1, that is, the weight of the feature map of the scale on different channels can be obtained. For example, taking the feature map X1 as an example, X1 is a three-dimensional tensor of H × W × C, and the corresponding one-dimensional tensor is normalized 1 × C, the weights of the feature map X1 on different channels can be obtained.

By adopting the embodiment, when fusing the feature maps of multiple scales, the feature maps of multiple scales are multiplied by the corresponding one-dimensional tensors in the channel dimension to obtain the feature maps obtained by multiplying the feature maps of multiple scales respectively, and then the feature maps obtained by multiplying the feature maps of multiple scales respectively are subjected to feature addition according to the channel dimension to obtain the fused feature map.

As shown in fig. 3, a fused feature map Y is obtained by multiplying each of feature maps X1, X2, X3, and X4 by Sigmoid (FC1(F2)), Sigmoid (FC2(F2)), Sigmoid (FC3(F2)), and Sigmoid (FC4(F2)) and adding the 4 results obtained by the multiplication.

By adopting the method, the characteristic difference of the characteristic diagram of each scale on the channel dimension can be obtained, namely, the difference of the characteristic diagrams of different scales can be reflected from the channel dimension, so that the weight of different channels corresponding to the characteristic diagram of each scale is obtained, and then the high-level semantic information of different scales is fully mined from the channel dimension, so that the fused characteristic diagram can fully reflect the high-level semantic meaning of the image to be segmented, a more accurate semantic segmentation result can be obtained, and the accuracy of semantic segmentation is improved.

The second mode is as follows: and processing the characteristic maps of various scales according to the spatial dimension.

Correspondingly, referring to fig. 5, a schematic flow chart illustrating processing of feature maps of multiple scales according to a spatial dimension is shown, and when the feature values of the same target dimension in the feature maps of multiple scales are fused to obtain a three-dimensional tensor of the target dimension, as shown in step S501 below:

step S501: and splicing the eigenvalues of the same space dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the space dimension.

In this embodiment, the spatial dimension generally refers to features of different spatial positions, where H × W is the spatial dimension. Because the characteristic diagrams of multiple scales need to be processed according to the spatial dimension, all characteristic values at the same spatial position in the characteristic diagrams of multiple scales can be spliced end to end, so that the characteristics of the channel dimension of the characteristic diagrams of different scales are transformed to the spatial dimension, and the three-dimensional tensor of the spatial dimension is obtained. All eigenvalues located at the same spatial position in the three-dimensional tensor of the spatial dimension include eigenvalues of different channels at the spatial position.

As shown in fig. 5, the obtained feature map includes a feature map X1, a feature map X2, a feature map X3, and a feature map X4, and the feature map X1, the feature map X2, the feature map X3, and the feature map X4 may be spliced to obtain Xc ═ Concat (X1, X2, X3, and X4); xc is a four-dimensional tensor of 4 × hxwxc, and then, the four-dimensional tensor Xc may be converted according to a spatial dimension, that is, eigenvalues of the same spatial dimension in the three-dimensional tensors of 4 hxwxc are spliced to obtain the three-dimensional tensor Xs of hxwx4C.

As shown in fig. 6, a schematic diagram of processing according to a spatial dimension is shown, and as shown in fig. 6, taking a feature map of two scales as an example, the feature of the feature map X1 in the spatial dimension and the feature of the feature map X2 in the spatial dimension are spliced, which means that all the features belonging to the same spatial position in the feature maps X1 and X2 are spliced, for example, all the features belonging to the position S in the feature maps X1 and X2 are spliced, and all the features belonging to the position S include features of multiple channels at the position S, for example, taking the number of channels as 3 as an example, three feature values (2,1,3) of

channels

1, 2, 3 at S in the feature map X1 are included, and three feature values (3,2,1) of three channels at S in the feature map X2 are also included. Of course, fig. 6 is only an exemplary illustration, wherein the number of channels may be 10, 20, etc.

In an example, the feature maps of multiple scales may be spliced first, specifically, the feature maps may be spliced end to end, and the feature maps obtained by splicing are then spliced according to the spatial dimension, so as to obtain the three-dimensional tensor of the spatial dimension.

The characteristic images of various scales are spliced according to the space dimension, so that the combination of the characteristics of the characteristic images of different scales on the channel dimension is realized, and the characteristics of the same space position comprise the characteristics of different channels of the characteristic images of various scales. As shown in fig. 6, after the features of the feature map X1 in the spatial dimension are spliced with the features of the feature map X2 in the spatial dimension, the obtained three-dimensional tensor of the spatial dimension is shown as the feature map Xs in the map, and it can be seen that the feature value splicing of the feature maps of different scales in the spatial dimension is realized.

When obtaining the weight values of the feature maps of the multiple scales in the target dimension according to the three-dimensional tensor of the target dimension, the following steps S502 to S503 may be performed:

step S502: and inputting the three-dimensional tensor of the space dimension into the convolution layer to obtain the two-dimensional tensors of the plurality of scales in the space dimension.

In this embodiment, since all the eigenvalues at the same spatial position in the three-dimensional tensor of the spatial dimension include the eigenvalues of the multiple channels of the eigen map at the spatial position of multiple scales, the three-dimensional tensor of the spatial dimension may be input to the convolution layer to obtain two-dimensional tensors of the multiple scales in the spatial dimension, that is, the eigenvalues of the multiple channels at the same spatial position are combined into one value. The convolution layer can be a convolution of 1 × 1, and through convolution processing, the feature values of the feature map of each scale at the same position on three channels can be fused, so that the two-dimensional tensor reflects the feature difference of different spatial positions.

As shown in fig. 5, a1 × 1 convolution may be performed on the three-dimensional tensor Xs of H × W × 4C to obtain a three-dimensional tensor Xa of 4 × H × W, that is, a two-dimensional tensor including 4H × W.

As shown in fig. 6, the three-dimensional tensor Xs is input to the convolutional layer, and two-dimensional tensors Xa1 and Xa2 of two spatial dimensions are obtained.

Step S503: and normalizing the two-dimensional tensors of the multiple scales in the space dimension respectively to obtain the weight values of the feature maps of the multiple scales in the space dimension respectively.

In this example, normalizing the two-dimensional tensors of the plurality of scales in the spatial dimension may be: the values in the two-dimensional tensor of the spatial dimension of each scale are normalized to values between 0 and 1, so as to obtain the weight value of the feature map of each scale in the spatial dimension, that is, obtain the weight of the feature of each position in different scales, as shown in fig. 5, for example, obtain the weight of the position S in step size 1, the weight in step size 12, the weight in step size 24, and the weight in step size 36.

By adopting the embodiment, when fusing the feature maps of multiple scales, the method can be used for respectively multiplying the feature maps of multiple scales with the corresponding two-dimensional tensors in the space dimension to obtain the feature maps obtained by multiplying the feature maps of multiple scales, and then adding the features of the feature maps obtained by multiplying the feature maps of multiple scales according to the space dimension to obtain the fused feature map.

The multiplication of the feature maps of multiple scales and the corresponding two-dimensional tensors in the spatial dimension may be: and multiplying the eigenvalue of each channel belonging to the same spatial position in the characteristic diagram of the scale by the weight value of the same spatial position in the two-dimensional tensor.

As shown in fig. 5, the 4 three-dimensional tensors included in the feature Xc are: and X1, X2, X3 and X4 are multiplied by the two-dimensional tensors of 4H multiplied by W included in Xa, and the 4 multiplied results are added to obtain a three-dimensional tensor of H multiplied by W multiplied by C, wherein the three-dimensional tensor of H multiplied by W multiplied by C is the fusion characteristic map.

As shown in fig. 6, after the three-dimensional tensor Xs is input into the convolutional layer, two-dimensional tensors Xa1 and Xa2 of two spatial dimensions are obtained, wherein Xa1 corresponds to an eigen map X1, Xa2 corresponds to an eigen map X2, Xa1 and Xa2 are normalized, Xa1 is multiplied by an eigen map X1, Xa2 is multiplied by an eigen map X2, and then the multiplied results are added to obtain a fused eigen map.

By adopting the method, the characteristic difference of the characteristic diagram of each scale on different spatial positions can be obtained, so that the weights of different spatial positions are obtained, and then the high-level semantic information of different scales is fully mined from the spatial dimension, so that the fused characteristic diagram can fully reflect the high-level semantics of the image to be segmented, a more accurate semantic segmentation result can be obtained, and the accuracy of semantic segmentation is improved.

The third mode is as follows: and processing the characteristic graphs of various scales according to the space dimension and the channel dimension.

When processing the feature maps of multiple scales according to the space dimension and the channel dimension, the process from step S301 to step S304 may be referred to first, and the feature maps of multiple scales are processed according to the channel dimension, so as to obtain the weight values of the feature maps of multiple scales in the channel dimension. And referring to the processes from step S501 to step S503, processing the feature maps of multiple scales according to the spatial dimension to obtain the weight values of the feature maps of multiple scales in the spatial dimension. And then, fusing the feature graphs of multiple scales according to the weight values of the feature graphs of multiple scales in the space dimension and the weight values of the feature graphs of multiple scales in the channel dimension.

Correspondingly, referring to fig. 7, a flowchart illustrating a step of fusing feature maps of multiple scales from a space dimension and a channel dimension is shown, and as shown in fig. 7, the method specifically includes the following steps:

step S701: and fusing the feature graphs of the multiple scales according to the weight values of the feature graphs of the multiple scales in the channel dimension to obtain a fused feature graph of the channel dimension.

The feature graphs of multiple scales and the one-dimensional tensors of the feature graphs of the multiple scales in the channel dimension can be multiplied respectively, that is, the feature values of different spatial positions on the same channel are multiplied by the weight value of the channel, so that feature graphs obtained by multiplying the feature graphs of the multiple scales are obtained, and then the feature graphs obtained by multiplying the feature graphs of the multiple scales are subjected to feature addition according to the channel dimension, so that the fusion feature graph of the channel dimension is obtained.

Step S702: and fusing the characteristic graphs of the multiple scales according to the weight values of the characteristic graphs of the multiple scales in the space dimension to obtain a fused characteristic graph of the space dimension.

The feature maps of multiple scales can be multiplied by the corresponding two-dimensional tensors in the spatial dimension, that is, feature values belonging to the same spatial position on each channel in the feature map are multiplied by the weight value of the spatial position in the two-dimensional tensor to obtain feature maps obtained by multiplying the feature maps of multiple scales, and then feature addition is performed on the feature maps obtained by multiplying the feature maps of multiple scales according to the spatial dimension, so that a fused feature map of the spatial dimension is obtained.

Step S703: and adding or splicing the characteristic values corresponding to the positions on the fusion characteristic diagram of the channel dimension and the fusion characteristic diagram of the space dimension to obtain the fusion characteristic diagram.

In this embodiment, the feature values belonging to the same spatial position on the same channel on the fusion feature map of the channel dimension and the fusion feature map of the spatial dimension may be added or spliced, so as to obtain the fusion feature map.

By adopting the technical scheme of the embodiment of the application, the characteristic difference of the characteristic diagrams of different scales at different spatial positions is obtained from the spatial dimension, and the characteristic difference of the characteristic diagrams of different scales on different channels is obtained from the channel dimension, so that the importance of the characteristic diagrams of different scales can be comprehensively embodied in two dimensions of the space and the channel, high-level semantic information of different scales can be fully mined and fused from the spatial dimension and the channel dimension, a more accurate semantic segmentation result is obtained, and the accuracy of semantic segmentation is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 8, a block diagram of an image segmentation apparatus according to an embodiment of the present invention is shown, and as shown in fig. 8, the apparatus may specifically include the following modules:

a feature map obtaining module 801, configured to obtain a feature map of an image to be segmented;

a multi-scale feature extraction module 802, configured to perform feature extraction of multiple scales on the feature map of the image to be segmented to obtain feature maps of multiple scales;

the attention module 803 is configured to process the feature maps of the multiple scales according to a target dimension, and determine weight values of the feature maps of the multiple scales in the target dimension;

the fusion module 804 is configured to fuse the feature maps of multiple scales according to the weight values of the feature maps of multiple scales in the target dimension, so as to obtain a fusion feature map;

and a segmentation module 805, configured to segment the image to be segmented according to the fusion feature map, so as to obtain categories to which each pixel included in the image to be segmented belongs.

Optionally, the apparatus may further include the following modules:

the global pooling module is used for carrying out global feature extraction on the feature map of the image to be segmented to obtain a global feature map;

the segmentation module 805 may be specifically configured to segment the image to be segmented according to the fusion feature map, the global feature map, and the feature map of the image to be segmented, so as to obtain categories to which each pixel included in the image to be segmented belongs.

Optionally, the segmentation module 805 includes the following units:

the splicing unit is used for splicing the fusion feature map, the global feature map and the feature map of the image to be segmented according to the channel dimension to obtain a spliced feature map;

and the convolution unit is used for performing convolution processing on the splicing characteristic graph to obtain the category to which each pixel point included in the image to be segmented belongs.

Optionally, the attention module 803 may specifically include the following sub-modules:

the fusion submodule is used for fusing the eigenvalues of the same target dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the target dimension;

and the determining submodule is used for obtaining the weight values of the feature maps of various scales in the target dimension according to the three-dimensional tensor of the target dimension.

Optionally, the target dimension is a channel dimension; the fusion submodule is specifically configured to add eigenvalues of the same channel dimension in the eigenvalues of multiple scales to obtain a three-dimensional tensor of the channel dimension; the determination submodule may specifically include the following units:

the first conversion unit is used for sequentially inputting the three-dimensional tensor of the channel dimension into the global average pooling layer and the first full-connection layer to obtain the one-dimensional tensor of the channel dimension;

the full-connection unit is used for inputting the one-dimensional tensor of the channel dimension into the full-connection layer corresponding to each of the multiple dimensions to obtain the one-dimensional tensor of each of the multiple dimensions in the channel dimension;

and the first normalization processing unit is used for performing normalization processing on the one-dimensional tensors of the multiple scales in the channel dimension respectively to obtain the weight values of the feature maps of the multiple scales in the channel dimension respectively.

Optionally, the target dimension is a spatial dimension; the fusion sub-module is specifically configured to splice eigenvalues of the same spatial dimension in the eigen maps of multiple scales to obtain a three-dimensional tensor of the spatial dimension;

the determination submodule may specifically include the following units:

the second conversion unit is used for inputting the three-dimensional tensor of the space dimension into the convolution layer to obtain two-dimensional tensors of various dimensions in the space dimension;

and the second normalization processing unit is used for performing normalization processing on the two-dimensional tensors of the multiple scales in the space dimension respectively to obtain the weight values of the feature maps of the multiple scales in the space dimension respectively.

Optionally, the target dimensions include spatial dimensions and channel dimensions; the fusion module 804 may specifically include the following units:

the first fusion unit is used for fusing the feature maps of the multiple scales according to the weight values of the feature maps of the multiple scales in the channel dimension to obtain a fusion feature map of the channel dimension;

the second fusion unit is used for fusing the feature maps of the multiple scales according to the weight values of the feature maps of the multiple scales in the space dimension to obtain a fusion feature map of the space dimension;

and the fusion unit is used for adding or splicing the characteristic values corresponding to the positions on the fusion characteristic diagram of the channel dimension and the fusion characteristic diagram of the space dimension to obtain the fusion characteristic diagram.

Optionally, the multi-scale feature extraction module 802 may be specifically configured to input the feature map of the image to be segmented into a plurality of hole convolution layers with different step lengths, so as to obtain the feature maps with multiple scales.

It should be noted that the device embodiments are similar to the method embodiments, so that the description is simple, and reference may be made to the method embodiments for relevant points.

Embodiments of the present invention further provide an electronic device, which may be configured to execute an image segmentation method and may include a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor is configured to execute the image segmentation method.

Embodiments of the present invention further provide a computer-readable storage medium storing a computer program for causing a processor to execute the image segmentation method according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing detailed description of the image segmentation method, apparatus, device and storage medium provided by the present invention has been provided, and the present application has been described with specific examples to explain the principles and embodiments of the present invention, and the description of the above examples is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of image segmentation, the method comprising:

obtaining a feature map of an image to be segmented;

2. The method of claim 1, further comprising:

3. The method according to claim 2, wherein the step of segmenting the image to be segmented according to the fusion feature map, the global feature map and the feature map of the image to be segmented to obtain the category to which each pixel included in the image to be segmented belongs comprises:

4. The method according to any one of claims 1 to 3, wherein the processing the feature maps of the plurality of scales according to the target dimension, and determining the weight values of the feature maps of the plurality of scales in the target dimension respectively comprises:

5. The method of claim 4, wherein the target dimension is a channel dimension; fusing the eigenvalues of the same target dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the target dimension, comprising the following steps:

6. The method of claim 4, wherein the target dimension is a spatial dimension; fusing the eigenvalues of the same target dimension in the characteristic diagrams of multiple scales to obtain a three-dimensional tensor of the target dimension, comprising the following steps:

7. The method of any of claims 1-6, wherein the target dimensions include a spatial dimension and a channel dimension; according to the weighted values of the feature maps of the multiple scales in the target dimension, fusing the feature maps of the multiple scales to obtain a fused feature map, wherein the fused feature map comprises:

8. The method according to any one of claims 1 to 7, wherein the extracting of the features of the image to be segmented in multiple scales is performed to obtain the feature of the image to be segmented in multiple scales, and the method comprises the following steps:

and inputting the characteristic diagram of the image to be segmented into a plurality of cavity convolution layers with different step lengths to obtain the characteristic diagrams with various scales.

9. An image segmentation apparatus, characterized in that the apparatus comprises:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the image segmentation method according to any one of claims 1 to 8.

11. A computer-readable storage medium storing a computer program for causing a processor to execute the image segmentation method according to any one of claims 1 to 8.