CN114913182A

CN114913182A - Image segmentation method, device, equipment and storage medium

Info

Publication number: CN114913182A
Application number: CN202210710965.4A
Authority: CN
Inventors: 纪德益; 赵一儒; 陶明渊; 黄建强; 华先胜
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-08-16

Abstract

The application provides an image segmentation method, an image segmentation device, image segmentation equipment and a storage medium, wherein the method comprises the following steps: performing feature extraction on a first image block and a second image block in a target image to obtain a first feature map and a second feature map; and determining a convolution range corresponding to the target convolution kernel in the first characteristic diagram and the second characteristic diagram according to the position relation of the first image block and the second image block and the set expansion rate. And performing hole convolution processing in a convolution range by adopting a target convolution kernel to obtain a third feature map and a fourth feature map, determining image segmentation results of the first image block and the second image block according to the third feature map and the fourth feature map, and determining an image segmentation result of the target image according to the image segmentation results of the plurality of image blocks. The perception of context information of different image blocks is realized through cross-block hole convolution operation, and more accurate segmentation results can be obtained; the method can also be used in the field of virtual reality to render the image segmentation result to a display of the hardware device.

Description

Image segmentation method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to an image segmentation method, apparatus, device, and storage medium.

Background

Compared with a normal Resolution Image (UHR), an Ultra-high Resolution Image (UHR) has a Resolution of at least 2K, even up to 4K or 6K. The UHR image generally comes from remote sensing satellites, unmanned aerial vehicle scene shooting and the like, and has large image content span and wide coverage area.

In some practical application scenarios, image segmentation processing (also referred to as semantic segmentation processing) needs to be performed on an image to predict a class corresponding to each pixel in an input image, that is, to implement classification processing at a pixel level. Generally, semantic segmentation of an input image is achieved by training an image segmentation model. Aiming at the UHR image, in the model training process, the UHR image is limited by training duration and occupation of video memory, and in the normal training process, the UHR image is down-sampled to a smaller size and then is input into the network model so as to complete the training of the network model by combining the supervision information of the image. Although the training mode reduces the occupation of video memory by reducing the image size, the down-sampling operation can lose many local detail features in the image, so that the accuracy of the final segmentation result is poor.

Disclosure of Invention

The embodiment of the invention provides an image segmentation method, an image segmentation device, image segmentation equipment and a storage medium, which are used for improving the accuracy of an image segmentation result.

In a first aspect, an embodiment of the present invention provides an image segmentation method, where the method includes:

partitioning a target image to obtain a plurality of image blocks;

respectively extracting features of a first image block and a second image block in the plurality of image blocks to obtain a first feature map and a second feature map;

determining a convolution range corresponding to a target convolution kernel in the first characteristic diagram and the second characteristic diagram according to the position relation of the first image block and the second image block and a set expansion rate, wherein the target convolution kernel corresponds to the expansion rate;

performing hole convolution processing in the convolution range by using the target convolution kernel to obtain a third feature map after updating the first feature map and a fourth feature map after updating the second feature map;

determining image segmentation results of the first image block and the second image block according to the third feature map and the fourth feature map;

and determining the image segmentation result of the target image according to the image segmentation results of the image blocks.

In a second aspect, an embodiment of the present invention provides an image segmentation apparatus, including:

the segmentation module is used for segmenting the target image to obtain a plurality of image blocks;

the extraction module is used for respectively extracting features of a first image block and a second image block in the plurality of image blocks to obtain a first feature map and a second feature map;

a convolution module, configured to determine, according to a position relationship between the first image block and the second image block and a set expansion rate, a convolution range corresponding to a target convolution kernel in the first feature map and the second feature map, where the target convolution kernel corresponds to the expansion rate; performing hole convolution processing in the convolution range by using the target convolution kernel to obtain a third feature map after updating the first feature map and a fourth feature map after updating the second feature map;

the segmentation module is used for determining image segmentation results of the first image block and the second image block according to the third feature map and the fourth feature map; and determining the image segmentation result of the target image according to the image segmentation results of the image blocks.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the image segmentation method according to the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the image segmentation method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides an image segmentation method, where the method includes:

receiving a request triggered by user equipment by calling an image segmentation service, wherein the request comprises a target image;

executing the following steps by utilizing the processing resource corresponding to the image segmentation service:

partitioning a target image to obtain a plurality of image blocks;

determining convolution ranges corresponding to target convolution kernels in the first feature map and the second feature map according to the position relation of the first image block and the second image block and a set expansion rate, wherein the target convolution kernels correspond to the expansion rate;

determining an image segmentation result of the target image according to the image segmentation results of the image blocks;

and feeding back the target image marked with the image segmentation result to the user equipment.

In a sixth aspect, an embodiment of the present invention provides an image segmentation method, which is applied to an augmented reality device, and the method includes:

partitioning the obtained target image to obtain a plurality of image blocks;

and displaying the target image marked with the image segmentation result.

In the embodiment of the present invention, when an image segmentation process needs to be performed on a target image with ultrahigh resolution, first, a plurality of image blocks are obtained by partitioning the target image, and feature extraction is performed on a first image block and a second image block in the plurality of image blocks respectively to obtain a first feature map and a second feature map. And then, according to the position relation of the first image block and the second image block and the set expansion rate, determining a convolution range corresponding to a target convolution kernel in the first characteristic diagram and the second characteristic diagram, wherein the target convolution kernel is a cavity convolution kernel with the expansion rate. And then, performing hole convolution processing in the convolution range by adopting a target convolution kernel to obtain a third feature map after the first feature map is updated and a fourth feature map after the second feature map is updated, and determining the image segmentation results of the first image block and the second image block according to the third feature map and the fourth feature map. Since the first image block and the second image block are any two image blocks included in the target image, performing the above processing on each image block can obtain an image segmentation result corresponding to each image block, thereby obtaining an image segmentation result of the target image.

In the above scheme, a cross-block hole convolution mode is provided, that is, the convolution range includes a first feature map corresponding to a first image block and a partial area in a second feature map corresponding to a second image block, and through the cross-block hole convolution mode, one image block can sense information in other image blocks of the full image, so that context information of the full image is obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic diagram of a standard convolution kernel and a hole convolution kernel;

FIG. 2 is a flowchart of an image segmentation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a distance between a pair of image blocks according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a convolution range of a pair of feature maps provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of cross-block hole convolution according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating an application of an image segmentation method according to an embodiment of the present invention;

FIG. 7 is a flowchart of an image segmentation method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an implementation of the embodiment shown in FIG. 7;

FIG. 9 is a flowchart of a method for training an image segmentation model according to an embodiment of the present invention;

FIG. 10 is a flowchart of a method for training an image segmentation model according to an embodiment of the present invention;

fig. 11 is a schematic diagram illustrating an application of an image segmentation method according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of an image segmentation apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structure of an electronic device provided in this embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

The following concepts involved in the embodiments of the present invention will be explained.

Image segmentation, also known as semantic segmentation, is used to predict the class of each pixel of the input image.

Convolution kernel, that is, when an image is processed, given an input image, each corresponding pixel in an output image is formed after weighted averaging of pixels in a small region in the input image, wherein a weight is defined by a function, and the function is called convolution kernel.

Hole Convolution (scaled Convolution), also known as dilation Convolution, a Convolution operator in neural networks. Unlike standard convolution, hole convolution adds some holes to the convolution kernel, which can enlarge the receptive field. The hole convolution has one hyper-parameter: the dilation rate (dilation rate) represents the interval of the convolution kernel. Compared with the standard convolution, the hole convolution can ensure that a larger receptive field is obtained under the condition of the same calculation amount by setting different partition rates. The standard convolution has a resolution ratio of 1, and a standard convolution kernel of 3 × 3 and a corresponding hole convolution kernel having a resolution ratio of 2 are illustrated in fig. 1.

The cross-block hole convolution refers to hole convolution spanning different image blocks, and is a new convolution mode provided in the embodiment of the invention. Assuming that the image to be subjected to convolution processing is an image x, taking a standard convolution kernel of 3 × 3 as an example, in the process of performing convolution processing by using the convolution kernel, one pixel in the image x is traversed each time, a nine-square grid is formed by eight surrounding pixels with the pixel as the center, the nine-square grid is used as a window of the current convolution, and weighted summation calculation is performed by the 9 pixels and 9 weights in the convolution kernel to obtain a characteristic value corresponding to the pixel in an output image. At the next convolution calculation, another pixel is traversed, i.e., slid to the next window, the convolution calculation is performed, and so on. Based on the above convolution calculation principle, in the cross-block hole convolution, it is equivalent to one convolution window covering two image blocks, i.e. one convolution window contains pixels from two image blocks. In the embodiment of the invention, the pixel refers to the position of the pixel point, but not the pixel value.

A block-cutting training method is characterized in that when an image segmentation model is trained on an ultra-high resolution image, an original large image is uniformly cut into small image blocks and respectively input into a network model for training in order to ensure reasonable video memory occupation due to the fact that the image is large.

In practical application, image segmentation processing is usually completed by training an image segmentation model, and for an ultrahigh-resolution image, a block training method can be adopted, namely, an original large image is uniformly cut into small image blocks, and the small image blocks are respectively input into a network model for training. In an alternative training mode, any cut image block is input into the model, the model predicts the image segmentation result (i.e. the class corresponding to each pixel) of the image block, and then performs the calculation of the loss function based on the image segmentation supervision information corresponding to the image block and the prediction result, so as to adjust the model parameters according to the loss function. Although the occupation of the video memory is reduced by the image partitioning method, each image block is independent, the model can only learn the information in a single image block, the global context information of the original image is lost, and the accuracy of the final original image segmentation result is affected.

The embodiment of the invention provides a new image segmentation scheme, and a more accurate segmentation result is obtained under the condition of ensuring reasonable video memory occupation.

Fig. 2 is a flowchart of an image segmentation method according to an embodiment of the present invention, as shown in fig. 2, the method includes the following steps:

201. and partitioning the target image to obtain a plurality of image blocks.

202. And respectively performing feature extraction on a first image block and a second image block in the plurality of image blocks to obtain a first feature map and a second feature map.

203. And determining a convolution range corresponding to a target convolution kernel in the first characteristic diagram and the second characteristic diagram according to the position relation of the first image block and the second image block and the set expansion rate, wherein the target convolution kernel corresponds to the expansion rate.

204. And performing hole convolution processing in the convolution range by adopting a target convolution kernel to obtain a third feature map after the first feature map is updated and a fourth feature map after the second feature map is updated.

205. Determining image segmentation results of the first image block and the second image block according to the third feature map and the fourth feature map; and determining an image segmentation result of the target image according to the image segmentation results of the plurality of image blocks.

The image segmentation method provided by the embodiment of the invention can be suitable for carrying out image segmentation processing on an ultra-high resolution image, and can firstly segment a target image to be subjected to image segmentation processing into a plurality of image blocks by taking the block training thought as a reference. For ease of handling, the image blocks are the same size.

An image segmentation model may be trained in advance, and the image segmentation process on the target image may be implemented using the image segmentation model. However, the training mode of the image segmentation model is different from the traditional mode of performing model training based on image blocks, and will be described in detail below. Since the model training process is similar to the process of image segmentation processing on the target image using the model, only the process using the model will be described first.

After the target image is subjected to the segmentation processing of a plurality of image blocks, a pair of image blocks, namely a first image block and a second image block, can be randomly selected from the target image, and then the two image blocks are respectively input into an image segmentation model for feature extraction, so that a first feature map and a second feature map are obtained.

In practical applications, the image segmentation model may include a plurality of convolutional layers for feature extraction, each convolutional layer may output a feature map of one scale, the first feature map and the second feature map are feature maps output after passing through a target convolutional layer in the model, and the target convolutional layer may be set according to experiments, and is generally a later convolutional layer, such as a penultimate convolutional layer or a third convolutional layer, in the plurality of convolutional layers, at this time, a feature map containing richer features may be obtained through feature extraction of the previous plurality of convolutional layers.

In other words, in the present embodiment, it is not required that the cross-block hole convolution processing be performed on both of the pair of feature maps output by each convolutional layer, and the cross-block hole convolution processing may be performed only on the pair of feature maps output by the target convolutional layer, for example, the pair of feature maps corresponding to the pair of input image blocks, such as the first feature map and the second feature map.

As can be seen from the above explanation of the meaning of the cross-block hole convolution, in one convolution calculation, the pixels contained in the window in which the convolution calculation is performed are from two feature maps, in this way, in the feature map (updated feature map) output by convolution calculation, the feature vector corresponding to a certain pixel is obtained by performing weighted average processing on the feature vectors of other pixels belonging to the same convolution window as the pixel in the two feature maps by a convolution kernel, this allows the pixels in one image block to perceive the information contained in another image block, so that the image segmentation process is not finally performed according to the feature map independent of each image block, instead, the image segmentation processing is performed based on the feature map in which the context information included in the different image blocks is fused, and a more accurate segmentation result can be obtained.

Specifically, in order to perform the cross-block hole convolution processing on the first feature map and the second feature map, it is necessary to first determine a convolution range corresponding to a target convolution kernel corresponding to the expansion ratio, that is, a convolution kernel having the expansion ratio, in the first feature map and the second feature map, based on the positional relationship between the first image block and the second image block and the set expansion ratio.

In practical applications, the expansion ratio can be set according to practical requirements, such as values of 100, 150, 200, and the like.

It should be noted that, if the first image block and the second image block are a pair of image blocks randomly selected from a plurality of image blocks included in the target image, the distances corresponding to the first image block and the second image block in the target image may be relatively far, and at this time, if the distance corresponding to the given expansion ratio is smaller than the distance between the two image blocks, the convolution range capable of being calculated by the target convolution kernel cannot be determined in the first feature map and the second feature map at this time.

Based on this, optionally, on the one hand, a plurality of different expansion ratios may be set, i.e., convolution kernels corresponding to the plurality of different expansion ratios are set, so that even in the case where the corresponding convolution range cannot be determined based on a smaller expansion ratio, the corresponding convolution range may be able to be determined based on a larger expansion ratio; moreover, even in the case where the corresponding convolution range can be determined based on a smaller expansion ratio, the setting of the convolution kernel corresponding to a larger expansion ratio can increase the perception range, and more context information can be obtained.

On the other hand, in order to ensure that as many image block pairs as possible can be subjected to cross-block hole convolution, taking the first image block and the second image block as an example, the first image block and the second image block may be determined according to a set expansion ratio, where a first distance between the first image block and the second image block is smaller than a second distance corresponding to the expansion ratio. That is, a pair of image blocks is selected from the plurality of image blocks under the constraint of the set expansion ratio.

Of course, in the case where a plurality of different expansion ratios are set, the selection of the image block may alternatively be made based on the smaller expansion ratio among them.

Wherein the distance between image blocks may be measured as the distance between the two vertices that are the closest between two image blocks, as shown in fig. 3. Wherein, regarding the distance corresponding to the expansion ratio: in fact, the dilation rate =2 indicates that 1 hole is inserted between two weight values in the convolution kernel, that is, the distance between two weight values is 2, which is actually equivalent to that, in the feature map, for a certain pixel i, the next pixel j spaced by 2 pixels is weighted, so that when the dilation rate =2, the corresponding distance is 2 pixels. It can be seen that the first distance between the image blocks and the second distance corresponding to the expansion ratio are pixel distances, i.e., how many pixels are apart.

Whether the first image block and the second image block are selected randomly or based on a set expansion ratio, determining a convolution range corresponding to the target convolution kernel in the first feature map and the second feature map according to a position relationship between the first image block and the second image block and the set expansion ratio may be implemented as:

and determining a first area in the first feature map and a second area in the second feature map according to a first distance (which can also be combined with the relative position) between the first image block and the second image block and a second distance corresponding to the expansion rate, wherein the first distance is smaller than the second distance, and at least one pair of pixels in the first area and the second area falls into a convolution window of the target convolution kernel. Thus, the first region in the first feature map and the second region in the second feature map are determined to be the convolution range corresponding to the target convolution kernel.

For ease of understanding, this is illustrated in connection with FIG. 4.

In fig. 4, it is assumed that the first image block and the second image block are an image block P and an image block N determined from the target image, where the image block N is located at the upper left of the image block P, and feature maps corresponding to the two image blocks are respectively represented as a feature map FP and a feature map FN. In addition, in fig. 4, assuming that convolution kernels of three different expansion rates are set, the expansion rate d1> d2> d3, where the second distance corresponding to d3 is larger than the first distance between the image block P and the image block N. At this time, based on the expansion ratio d3 and the distance between two image blocks, an area can be determined in the feature map FP and the feature map FN: w3p and W3n, which need to satisfy the following conditions: there is at least one pair of pixels that fall within the sliding convolution window of the convolution kernel for the dilation rate d 3. That is, at least one pixel i exists in the area W3p, and a pixel j having a distance of at least d3 from it can be found in the area W3n, that is, the distance between the pixel i and the pixel j is greater than or equal to d 3. Based on this condition, a pair of regions can be determined in the feature map FP and the feature map FN: w3p and W3n, which are convolution ranges of convolution kernels corresponding to the expansion ratio d 3. Similarly, in fig. 4, it is assumed that two pairs of regions W1p and W1n, and regions W2p and W2n are determined based on the expansion rates d1 and d2, respectively.

Taking the expansion ratio d3 as an example, after obtaining the region W3p in the feature map FP and the region W3n in the feature map FN, the convolution processing may be performed as follows:

determining a convolution window corresponding to a first pixel within region W3 p;

determining a second pixel within region W3n that falls within the convolution window;

determining a first weight corresponding to the first pixel and a second weight corresponding to the second pixel in the target convolution kernel according to the relative position between the first image block and the second image block and/or the distance between the first pixel and the second pixel;

performing convolution processing on the first pixel and the second pixel according to the first weight and the second weight to obtain an updated feature vector corresponding to the first pixel;

a third feature map is generated from the updated feature vector corresponding to each pixel in the region W3 p.

In this embodiment, it is assumed that the first feature map FP is used as a reference, and is subjected to feature update processing by cross-block hole convolution. In practical applications, the pixels included in the area W3p in the feature map FP may be traversed one by one, and assuming that the currently traversed pixel is the first pixel marked by a circle in the area in fig. 4, the corresponding second pixel is found in the area W3n for the first pixel. Specifically, assume that the standard convolution kernel corresponding to the expansion rate d3 is 3 × 3, that is, the target convolution kernel corresponding to the expansion rate d3 is formed by inserting d3-1 holes (corresponding to 0 padding) into the adjacent weights in the standard convolution kernel. Based on the assumption that the convolution window corresponding to the first pixel is determined to be centered on the first pixel, a convolution window can be determined according to the expansion rate d3, that is, the positions of 9 pixels weighted and averaged by the original 9 weights in the convolution kernel are determined (the inserted weight 0 and the corresponding pixel are ignored because the weighted multiplication result is 0), and eight pixels including the first pixel and at least one of the eight pixels is located in the region W3n, so that the at least one pixel in the region W3n that falls within the convolution window is determined to be the second pixel.

Then, according to the relative position between the image block N and the image block P and/or the distance between the first pixel and the second pixel, a first weight corresponding to the first pixel and a second weight corresponding to the second pixel in the target convolution kernel with the expansion rate d3 are determined. Specifically, in fig. 4, it is assumed that the image block N is at the upper left of the image block P, and it is assumed that the second pixel in the region W3N illustrated in the drawing is one, at this time, since the image block N is at the upper left of the image block P, the second pixel may be used as the upper left vertex pixel in the convolution window, and the first pixel may be used as the center pixel in the convolution window, at this time, the weights of the center position and the upper left vertex position need only to be activated (i.e., used) within the target convolution kernel (ignoring the inserted 0), and the other 7 weights need not be used, the first weight corresponding to the first pixel and the second weight corresponding to the second pixel are activated. Then, the feature vector of the first pixel in the feature map FP is multiplied by the first weight, the feature vector of the second pixel in the feature map FN is multiplied by the second weight, and the two products are added up to be the corresponding feature vector of the first pixel in the updated feature map FP' (i.e., the third feature map). In this example, as shown in fig. 4, each pixel included in the area W3P in the feature map FP corresponding to the image block P is traversed one by one, and after the convolution processing, a corresponding updated feature vector is obtained, while the feature vectors of the pixels outside the area W3P are kept unchanged, so that an updated feature map FP' corresponding to the feature map FP can be generated. While the profile FN is not actually changed. It can be understood that, if the pixels included in W3n in the feature map FN are traversed, that is, the convolution windows are determined based on the pixels in the region, the updated feature map corresponding to the feature map FN is finally obtained, and the feature map FP is not changed; if pixels in both regions in both feature maps are traversed, both feature maps are updated. In practical application, considering the calculation amount, it is not necessary that both regions in the two feature maps are traversed.

For ease of understanding, the above process of convolution across block holes is illustrated in conjunction with FIG. 5.

Fig. 5 illustrates a situation where two image blocks P and an image block N exhibit several different positional relationships. And in fig. 5, a hole convolution kernel obtained by setting a certain expansion ratio on a standard 3 x 3 convolution kernel is assumed, wherein 9 weights in the standard convolution kernel are w1-w9 schematically shown in the figure, and black squares represent inserted holes.

In the first case, it is assumed that the image block P is located at the lower right of the image block N, and the two circles illustrated in the figure represent the first pixel and the second pixel determined in the first area and the second area. Since the convolution window shown in the figure is determined by taking the first pixel as the center, and the second pixel is located at the upper left of the first pixel, in combination with the position relationship of the 9 weight values in the figure, it is possible to determine the use weights w5 and w1 only based on the relative position relationship between the first pixel and the second pixel, that is, the relative position relationship between the two image blocks, where the weight w5 corresponds to the first pixel and the weight w1 corresponds to the second pixel. The other 7 weights are not used this time.

In the second case, assuming that the image block P is located directly below the image block N, the two circles illustrated in the figure represent the first pixel and the second pixel determined in the first area and the second area above. Since the convolution window shown in the figure is determined by taking the first pixel as the center, and the second pixel is directly above the first pixel, in combination with the position relationship of the 9 weight values in the figure, the use weights w5 and w2 can be determined only based on the relative position relationship between the first pixel and the second pixel, i.e., the relative position relationship between the two image blocks, where the weight w5 corresponds to the first pixel and the weight w2 corresponds to the second pixel.

In a third case (two image blocks are not shown in the figure), when two second pixels exist in the convolution window determined by taking the first pixel as the center, the weights corresponding to the two second pixels can be determined according to the distance and the position relationship between the two second pixels and the first pixel respectively, and the weights w1 and w2 are assumed to be illustrated in the figure.

The above only takes the target convolution kernel corresponding to one expansion rate as an example to illustrate how to perform the cross-block hole convolution. It can be known from the above example that, in one convolution calculation process, only a small amount of weights in the target convolution kernel are used, that is, the calculation amount in each convolution processing process is relatively small, so that the occupation of the video memory is reduced, because the video memory only needs to store a small amount of convolution calculation results.

In addition, as described above, in practical applications, a plurality of convolution kernels with different expansion rates may be provided, and in this case, in summary, for the same first pixel, the updated feature vectors corresponding to the first pixel calculated by each convolution kernel need to be superimposed to update the feature vector corresponding to the first pixel. For example, based on the convolution kernel corresponding to d3, the feature vectors corresponding to the first pixel and the second pixel in the feature map FP and the feature map FN are weighted and summed to obtain a feature vector T3; based on the convolution kernel corresponding to d2, performing weighted summation on the corresponding feature vectors of the first pixel and the second pixel in the feature map FP and the feature map FN to obtain a feature vector T2; and performing weighted summation on the corresponding feature vectors of the first pixel and the second pixel in the feature map FP and the feature map FN based on the convolution kernel corresponding to d1 to obtain a feature vector T1. And then, adding the feature vectors T1, T2 and T3 to obtain a feature vector T4, and determining that the feature vector corresponding to the first pixel in the feature map FP' is T4.

Based on the above description of the embodiment, after obtaining the third feature map after the first feature map is updated and the fourth feature map after the second feature map is updated (in the above embodiment, it is assumed that the fourth feature map is the same as the second feature map), the image segmentation results of the first image block and the second image block, that is, the classes corresponding to the pixels in the first image block and the second image block, may be determined according to the third feature map and the fourth feature map.

Because the first image block and the second image block are any pair of images cut from the target image, the image block pairs contained in the first image block and the second image block are processed as above, so that the segmentation result of each image block can be obtained, and based on the segmentation result of each image block, the image segmentation result of the target image can be obtained: according to the category corresponding to each pixel in each image block, at least one frame corresponding to the same category can be marked in the target image, wherein a plurality of continuous pixels corresponding to the same category can determine one frame.

In summary, according to the cross-block hole convolution scheme provided by the embodiment of the present invention, the model can sense context information between different image blocks, so that image segmentation processing on the target image can be more accurately implemented. Moreover, when the cross-block hole convolution operation is executed each time, only the weights of a small number of positions in the convolution kernel are activated and used, the calculation amount is small, and the occupation of the video memory is reduced.

The practical implementation of the embodiment shown in fig. 2 can refer to fig. 6, wherein it is assumed that the target image is a landscape image captured by an unmanned aerial vehicle.

Fig. 7 is a flowchart of an image segmentation method according to an embodiment of the present invention, and as shown in fig. 7, the method includes the following steps:

701. and partitioning the target image to obtain a plurality of image blocks.

702. Respectively extracting features of a first image block and a second image block in the plurality of image blocks to obtain a first feature map and a second feature map; and performing down-sampling processing on the target image, and performing feature extraction on the down-sampled target image to obtain a fifth feature map.

703. And determining a convolution range corresponding to a target convolution kernel in the first characteristic diagram and the second characteristic diagram according to the position relation of the first image block and the second image block and the set expansion rate, wherein the target convolution kernel corresponds to the expansion rate.

704. And performing hole convolution processing in the convolution range by adopting a target convolution kernel to obtain a third feature map after the first feature map is updated and a fourth feature map after the second feature map is updated.

705. And according to the positions of the first image block and the second image block in the target image, a sixth feature map corresponding to the first image block and a seventh feature map corresponding to the second image block are cut out from the fifth feature map.

706. Fusing the third feature map and the sixth feature map, fusing the fourth feature map and the seventh feature map, and determining image segmentation results of the first image block and the second image block according to the fused feature maps; and determining an image segmentation result of the target image according to the image segmentation results of the plurality of image blocks.

The image segmentation scheme provided by this embodiment is based on the above cross-block hole convolution, and is assisted by global feature information to complete semantic segmentation of the target image. The timing relationship of the above steps in this embodiment is not strictly limited.

For ease of understanding, the description is exemplified in conjunction with fig. 8.

To complete the image segmentation process of the target image, it is still necessary to use an image segmentation model, as shown in fig. 8, which includes two branches, one branch including the first feature extractor and the other branch including the second feature extractor. In addition to the feature extractor, two branches may also be connected with an "output layer" after the feature extractor for implementing a network layer for classification prediction, not shown in the figure.

In practice, each feature extractor may comprise a plurality of convolutional layers. The two feature extractors may be identical in structure, except for the parameters, which depend on the model training results.

As shown in fig. 8, for a first image block and a second image block (image block P and image block N) in the target image, the first image block and the second image block are respectively input to the second feature extractor for feature extraction to obtain a first feature map and a second feature map. And for the first feature extractor: firstly, the target image is subjected to down-sampling processing, the down-sampling processing can be carried out on the target image to be the same as the first image block and the second image block, and then the target image after down-sampling is input into the first feature extractor to be subjected to feature extraction so as to obtain a fifth feature map. Since the fifth feature map is obtained by global feature extraction of the target image after down-sampling, rough full-map context information can be acquired based on the fifth feature map.

And then, according to the positions of the first image block and the second image block in the target image, a sixth feature map corresponding to the first image block and a seventh feature map corresponding to the second image block are cut out from the fifth feature map. It can be understood that, according to the mapping relationship between the fifth feature map output by the first feature extractor and the pixels in the target image, image areas corresponding to the first image block and the second image block respectively can be located in the fifth feature map as the sixth feature map and the seventh feature map.

In an optional embodiment, as described above, cross-block hole convolution processing may be performed on the first feature map and the second feature map (refer to the related description above, and are not described herein) to obtain a third feature map and a fourth feature map (the sizes of which are the same as those of the first feature map and the second feature map) after updating, and then the third feature map and the sixth feature map corresponding to the first image block are fused to obtain a fused feature map a, and the fourth feature map and the seventh feature map corresponding to the second image block are fused to obtain a fused feature map B.

However, in order to perform feature map fusion, it is necessary to first unify the sizes of the feature maps to be fused, and since the sizes of the sixth feature map and the seventh feature map are generally smaller than the sizes of the third feature map and the fourth feature map when the target image is downsampled to the same size as the first image block and the second image block, the sixth feature map and the seventh feature map are upsampled before fusion, and the sizes thereof are enlarged to be the same as the third feature map and the fourth feature map. And finally, adding the feature vectors corresponding to the same pixel in the third feature map and the sixth feature map together to obtain the feature vector corresponding to the pixel in the fused feature map A. And obtaining a characteristic diagram B in the same way.

In another alternative embodiment, after the sixth feature map and the seventh feature map are obtained, the above-mentioned upsampling process is performed, and then the sixth feature map and the seventh feature map are fused with the first feature map and the second feature map correspondingly to obtain a feature map a 'and a feature map B', and then the cross-block hole convolution process is performed on the feature map a 'and the feature map B' to obtain an updated feature map a and an updated feature map B.

After the feature map a and the feature map B are obtained, the image segmentation result of the first image block can be obtained according to the feature map a, and the image segmentation result of the second image block can be obtained according to the feature map B.

The above describes the use process of the image segmentation model, and the following describes the training process of the image segmentation model.

Fig. 9 is a flowchart of an image segmentation model training method according to an embodiment of the present invention, as shown in fig. 9, the method may include the following steps:

901. and partitioning the target image to obtain a plurality of image blocks, wherein the target image is a training sample image of the image segmentation model.

902. And respectively extracting features of a first image block and a second image block in the plurality of image blocks through an image segmentation model to obtain a first feature map and a second feature map, wherein the first image block and the second image block are a pair of image blocks input in the training process.

903. And determining a convolution range corresponding to a target convolution kernel in the first characteristic diagram and the second characteristic diagram according to the position relation of the first image block and the second image block and the set expansion rate, wherein the target convolution kernel corresponds to the expansion rate.

904. And performing hole convolution processing in the convolution range by adopting a target convolution kernel to obtain a third feature map after the first feature map is updated and a fourth feature map after the second feature map is updated.

905. And determining the category corresponding to each pixel in the third feature map and the fourth feature map according to the pixel category supervision information corresponding to the first image block and the second image block, and determining the sum result of the feature vectors corresponding to each pixel belonging to the target category in the third feature map and the fourth feature map as the category center feature vector corresponding to the target category in the training process.

906. And adding the class central feature vector corresponding to the target class in the training process and the class central feature vector corresponding to the target class in the previous training processes, and updating the feature vectors of the pixels corresponding to the target class in the third feature map and the fourth feature map respectively according to the added class central feature vector corresponding to the target class to obtain an eighth feature map and a ninth feature map.

907. And determining the image segmentation result of the first image block and the second image block according to the eighth feature map and the ninth feature map.

908. Determining a first loss corresponding to the first image block and a second loss corresponding to the second image block according to the image segmentation results of the first image block and the second image block and the pixel class supervision information corresponding to the first image block and the second image block respectively, and training an image segmentation model according to the sum result of the first loss and the second loss.

Wherein the model is trained on the basis of the sum of losses, i.e. the model parameters are adjusted by back propagation.

The image segmentation model trained in this embodiment can be used to perform the image segmentation scheme provided in the embodiment shown in fig. 2. In the present embodiment, for convenience of description, it is assumed that the target image is one training sample image in the model training sample set. The execution process of the

steps

901 and 904 can refer to the related descriptions in the foregoing embodiments, which are not described herein again.

In the model training process, auxiliary information of a class center feature vector is introduced.

Specifically, in step 905, after the third feature map and the fourth feature map are obtained, the categories corresponding to the pixels in the third feature map and the fourth feature map are determined according to the pixel category supervision information corresponding to the corresponding first image block and the corresponding second image block.

The monitoring information of the target image is called pixel type monitoring information, which marks the type corresponding to each pixel in the target image in a plurality of set types, and based on the monitoring information marking result of the target image, the monitoring information corresponding to each image block in the target image can be obtained.

In addition, it is understood that each pixel in the third feature map and the fourth feature map and each pixel in the corresponding first image block and second image block actually have a mapping relationship, and the mapping relationship can be determined by convolution operation corresponding to each convolution layer included in the model. Therefore, based on the mapping relationship and the pixel type supervision information corresponding to each pixel in the first image block and the second image block, the type corresponding to each pixel in the third feature map and the fourth feature map can be known.

And adding the feature vectors corresponding to the pixels belonging to the target class in the third feature map and the fourth feature map, and taking the sum as a class central feature vector corresponding to the target class in the training process. The target category refers to each of a plurality of categories for identifying the model, so that category center feature vectors respectively corresponding to the categories in the local training process are obtained.

In practical applications, the calculation of the class center feature vector of each class may be performed once for each pair of image blocks input into the model, assuming that the model has been trained for many times before the first image block and the second image block are input into the model. For the target category, the category center feature vector corresponding to the target category in the training process of this time may be added to the category center feature vectors corresponding to the target categories in the training processes of multiple times before, so as to update the feature vectors of the pixels corresponding to the target category in the third feature map and the fourth feature map, respectively, with the category center feature vector corresponding to the target category after the addition, so as to obtain an eighth feature map and a ninth feature map. That is, for any pixel in the third feature map corresponding to the target category, the feature vector corresponding to the pixel in the third feature map and the category center feature vector corresponding to the target category after the addition are added, and the addition result is used as the feature vector corresponding to the pixel in the eighth feature map.

Considering both the calculation amount and the occupation condition of the storage resources, the number of the class center feature vectors corresponding to the target classes which need to be superimposed together can be reasonably set, for example, the results obtained in the latest 100 training processes are added.

In addition, it should be noted that, in the early stage of model training, the model learning ability is not strong, the model parameters are not stable enough, and therefore the accuracy of the learned class center feature vector is poor, so, taking the class center feature vector of the target class obtained in the current training process as an example, assuming that the class center feature vector of the target class obtained in the previous 99 training processes needs to be superimposed, different weighting coefficients may be assigned to the 100 class center feature vectors, wherein the higher the weighting coefficient of the class center feature vector obtained in the later training is, the influence of the class center feature vector with insufficient accuracy learned in the early stage can be reduced.

In addition, in practical applications, two training times may be set, which are respectively represented as N1 and N2, N1 is much larger than N2, for example, N1=100, N2=5, and for the class center feature vector of the target class obtained in the current training process, on one hand, a weighted sum of the class center feature vector of the target class obtained in the current training process and the class center feature vector of the target class obtained in the previous 99 times (N1 times in total) training process may be calculated, on the other hand, a sum (which may also be a weighted sum) of the class center feature vector of the target class obtained in the current training process and the class center feature vector of the target class obtained in the previous 4 times (N2 times in total) training process may be calculated, and both sum results of these two sums may be updated to the third feature map and the fourth feature map. Specifically, the two summation results may be updated into the third feature map and the fourth feature map with different weights. For example, assuming that the sum of the class center feature vectors of the 100 object classes is represented as R1, the sum of the class center feature vectors of the 5 object classes is represented as R2, and the sum of R1 and R2 is weighted according to the set weight, and the result is represented as R3, for any pixel corresponding to the object class in the third feature map, the feature vector corresponding to the pixel in the third feature map and R3 may be added, and the sum may be regarded as the feature vector corresponding to the pixel in the eighth feature map.

In the above-mentioned parameter setting situation of N1 times of training, it can be understood that, in the model training process, only the calculation results of the class center feature vectors of each class obtained in the last N1 times of training processes need to be stored, and the calculation results before the last time can be deleted.

Fig. 10 is a flowchart of an image segmentation model training method according to an embodiment of the present invention, as shown in fig. 10, which may include the following steps:

1001. and partitioning the target image to obtain a plurality of image blocks, wherein the target image is a training sample image of an image segmentation model, and the image segmentation model comprises a first sub-model and a second sub-model.

1002. Respectively extracting features of a first image block and a second image block in the plurality of image blocks through a second sub-model to obtain a first feature map and a second feature map, wherein the first image block and the second image block are a pair of image blocks input in the training process; and performing down-sampling processing on the target image, and performing feature extraction on the down-sampled target image through the first sub-model to obtain a fifth feature map.

1003. And determining a convolution range corresponding to a target convolution kernel in the first characteristic diagram and the second characteristic diagram according to the position relation of the first image block and the second image block and the set expansion rate, wherein the target convolution kernel corresponds to the expansion rate.

1004. And performing hole convolution processing in the convolution range by adopting a target convolution kernel to obtain a third feature map obtained after the first feature map is updated and a fourth feature map obtained after the second feature map is updated.

1005. And according to the positions of the first image block and the second image block in the target image, a sixth feature map corresponding to the first image block and a seventh feature map corresponding to the second image block are intercepted from the fifth feature map.

1006. And fusing the third feature map and the sixth feature map to obtain a tenth feature map, and fusing the fourth feature map and the seventh feature map to obtain an eleventh feature map.

1007. Determining the category corresponding to each pixel in the tenth feature map and the eleventh feature map according to the pixel category supervision information corresponding to the first image block and the second image block, and determining the sum result of the feature vectors corresponding to each pixel belonging to the target category in the tenth feature map and the eleventh feature map as the category center feature vector corresponding to the target category in the training process.

1008. And adding the class central feature vector corresponding to the target class in the training process and the class central feature vector corresponding to the target class in the previous training processes, updating the feature vectors of the pixels corresponding to the target class in the tenth feature map and the eleventh feature map respectively according to the added class central feature vector corresponding to the target class, and obtaining an eighth feature map and a ninth feature map.

1009. And determining the image segmentation result of the first image block and the second image block according to the eighth feature map and the ninth feature map.

1010. Determining a first loss corresponding to the first image block and a second loss corresponding to the second image block according to image segmentation results of the first image block and the second image block and pixel category supervision information corresponding to the first image block and the second image block respectively, determining a third loss corresponding to the first sub-model according to the pixel category supervision information corresponding to the target image, training the second sub-model according to a sum result of the first loss and the second loss, and training the first sub-model according to the third loss.

The image segmentation model trained by the embodiment can be used to execute the image segmentation scheme provided by the embodiment shown in fig. 7. The image segmentation model includes two branches, which are respectively referred to as the first sub-model and the second sub-model. The first feature extractor and the second feature extractor described in the foregoing embodiments are included in the corresponding submodels.

In this embodiment, the execution of step 1007-1009 may refer to the related description in the embodiment shown in fig. 9, which is not described herein again.

In this embodiment, the image segmentation model further includes a first sub-model for processing the target image, in addition to the second sub-model for processing the image block, and for the first sub-model, the third loss may be calculated based on the pixel classification monitoring information of the target image and the image segmentation result output by the first sub-model, and then the parameter of the first sub-model is adjusted according to the third loss. And the first loss and the second loss are used to adjust model parameters of the second submodel.

In conclusion, in the model training process, the cross-block cavity convolution, the global feature after the target image is downsampled, and the class center feature are comprehensively used, so that an image segmentation model with good performance is ensured to be obtained, and the calculated amount and the occupation of the video memory in the training process are reduced.

The image segmentation method provided by the embodiment of the invention can be executed at the cloud end, a plurality of computing nodes (cloud servers) can be deployed at the cloud end, and each computing node has processing resources such as computing, storage and the like. In the cloud, a plurality of computing nodes may be organized to provide a service, and of course, one computing node may also provide one or more services. The cloud end can provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), Application Programming Interface (API), and other forms.

According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of the image segmentation service, and a user calls the service interface through user equipment to trigger an image segmentation request to the cloud end, wherein the request comprises a target image. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:

partitioning a target image to obtain a plurality of image blocks;

The above implementation process may refer to the related descriptions in the foregoing other embodiments, which are not described herein.

For ease of understanding, the description is exemplified in conjunction with fig. 11. The user may invoke the image segmentation service via user device E1 illustrated in fig. 11 to upload a service request containing the target image. In the cloud, as shown in the figure, it is assumed that the image segmentation service is provided by a service cluster E2, and at least one computing node is included in the service cluster E2. After receiving the request, the service cluster E2 executes the steps described in the foregoing embodiments to obtain the target image labeled with the image segmentation result, and sends the target image labeled with the image segmentation result to the user device E1. The user device E1 displays the target image, on the basis of which the user can perform further editing and the like.

An image segmentation apparatus according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means can each be constructed using commercially available hardware components and by performing the steps taught in this disclosure.

Fig. 12 is a schematic structural diagram of an image segmentation apparatus according to an embodiment of the present invention, and as shown in fig. 12, the apparatus includes: a segmentation module 11, an extraction module 12, a convolution module 13, and a segmentation module 14.

And the segmentation module 11 is configured to segment the target image to obtain a plurality of image blocks.

The extraction module 12 is configured to perform feature extraction on a first image block and a second image block in the plurality of image blocks, respectively, to obtain a first feature map and a second feature map.

A convolution module 13, configured to determine, according to a position relationship between the first image block and the second image block and a set expansion rate, a convolution range corresponding to a target convolution kernel in the first feature map and the second feature map, where the target convolution kernel corresponds to the expansion rate; and performing hole convolution processing in the convolution range by using the target convolution kernel to obtain a third feature map after the first feature map is updated and a fourth feature map after the second feature map is updated.

A segmentation module 14, configured to determine image segmentation results of the first image block and the second image block according to the third feature map and the fourth feature map; and determining the image segmentation result of the target image according to the image segmentation results of the image blocks.

Optionally, the convolution module 13 may specifically be configured to: determining a first area in the first feature map and a second area in the second feature map according to a first distance between the first image block and the second image block and a second distance corresponding to the expansion rate, wherein the first distance is smaller than the second distance, and at least one pair of pixels in the first area and the second area falls into a convolution window of the target convolution kernel; and determining the first area and the second area as convolution ranges corresponding to the target convolution kernels.

Optionally, the convolution module 13 may be specifically configured to: determining a convolution window corresponding to a first pixel within the first region; determining a second pixel within the second region that falls within the convolution window; determining a first weight corresponding to the first pixel and a second weight corresponding to the second pixel in the target convolution kernel according to the relative position between the first image block and the second image block and/or the distance between the first pixel and the second pixel; performing convolution processing on the first pixel and the second pixel according to the first weight and the second weight to obtain an updated feature vector corresponding to the first pixel; and generating the third feature map according to the updated feature vector corresponding to each pixel in the first region.

Optionally, the extracting module 12 is further configured to: performing downsampling processing on the target image; performing feature extraction on the downsampled target image to obtain a fifth feature map; according to the positions of the first image block and the second image block in the target image, a sixth feature map corresponding to the first image block and a seventh feature map corresponding to the second image block are obtained from the fifth feature map; and fusing the third feature map and the sixth feature map, and fusing the fourth feature map and the seventh feature map. The segmentation module 14 is specifically configured to: and determining the image segmentation result of the first image block and the second image block according to the fused feature map.

Optionally, the set expansion ratio comprises a plurality of different expansion ratios, and the target convolution kernel comprises a convolution kernel corresponding to each of the plurality of different expansion ratios.

Optionally, the extracting module 12 is specifically configured to: and determining the first image block and the second image block according to the expansion ratio, wherein a first distance between the first image block and the second image block is smaller than a second distance corresponding to the expansion ratio.

Optionally, the first feature map and the second feature map are extracted by an image segmentation model, the target image is a training sample image of the image segmentation model, and the first image block and the second image block are a pair of image blocks input in the current training process. The device further comprises: the training module is used for determining the corresponding category of each pixel in the third feature map and the fourth feature map according to the pixel category supervision information corresponding to the first image block and the second image block respectively; determining a sum result of feature vectors corresponding to pixels belonging to a target category in the third feature map and the fourth feature map, and taking the sum result as a category central feature vector corresponding to the target category in the training process; adding the class central feature vector corresponding to the target class in the training process and the class central feature vector corresponding to the target class in the previous training processes; respectively updating the feature vectors of the pixels corresponding to the target category in the third feature map and the fourth feature map by using the added category center feature vector corresponding to the target category to obtain an eighth feature map and a ninth feature map; determining image segmentation results of the first image block and the second image block according to the eighth feature map and the ninth feature map; determining a first loss corresponding to the first image block and a second loss corresponding to the second image block according to image segmentation results of the first image block and the second image block and pixel class supervision information corresponding to the first image block and the second image block respectively; and training the image segmentation model according to the sum result of the first loss and the second loss.

The apparatus shown in fig. 12 can perform the steps in the foregoing embodiments, and the detailed performing process and technical effects refer to the descriptions in the foregoing embodiments, which are not described herein again.

In one possible design, the structure of the image segmentation apparatus shown in fig. 12 can be implemented as an electronic device. As shown in fig. 13, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, makes the processor 21 at least capable of implementing the image segmentation method as provided in the previous embodiments.

In addition, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to at least implement an image segmentation method as provided in the foregoing embodiments.

In an alternative embodiment, the electronic device for performing the image segmentation method provided by the embodiment of the present invention may be an Extended Reality (XR) device. XR, which is a generic term for virtual reality, augmented reality, and other forms.

Optionally, the XR device may be deployed in a drone, for example, so that a camera on the drone may transmit the target image to the XR device after acquiring the ultrahigh resolution target image, and the XR device performs image segmentation processing on the target image. And then, the target image marked with the segmentation result can be transmitted to the control equipment at the ground end, so that timely semantic segmentation of the target image can be realized. Based on this, the control equipment of ground end can in time send corresponding control command to unmanned aerial vehicle according to cutting apart the result.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image segmentation method, comprising:

partitioning a target image to obtain a plurality of image blocks;

2. The method according to claim 1, wherein the determining a convolution range corresponding to a target convolution kernel in the first feature map and the second feature map according to the position relationship of the first image block and the second image block and a set expansion ratio comprises:

determining a first area in the first feature map and a second area in the second feature map according to a first distance between the first image block and the second image block and a second distance corresponding to the expansion rate, wherein the first distance is smaller than the second distance, and at least one pair of pixels in the first area and the second area falls into a convolution window of the target convolution kernel;

and determining the first area and the second area as convolution ranges corresponding to the target convolution kernels.

3. The method according to claim 2, wherein the performing, with the target convolution kernel, a hole convolution process in the convolution range to obtain the updated third feature map of the first feature map comprises:

determining a convolution window corresponding to a first pixel within the first region;

determining a second pixel within the second region that falls within the convolution window;

and generating the third feature map according to the updated feature vector corresponding to each pixel in the first region.

4. The method of claim 1, further comprising:

performing downsampling processing on the target image;

performing feature extraction on the downsampled target image to obtain a fifth feature map;

according to the positions of the first image block and the second image block in the target image, a sixth feature map corresponding to the first image block and a seventh feature map corresponding to the second image block are obtained from the fifth feature map;

fusing the third feature map with the sixth feature map, and fusing the fourth feature map with the seventh feature map;

determining an image segmentation result of the first image block and the second image block according to the third feature map and the fourth feature map, including:

and determining the image segmentation result of the first image block and the second image block according to the fused feature map.

5. The method of any one of claims 1 to 4, wherein the set dilation rate comprises a plurality of different dilation rates, and wherein the target convolution kernel comprises a convolution kernel for each of the plurality of different dilation rates.

6. The method according to any one of claims 1 to 4, further comprising:

and determining the first image block and the second image block according to the expansion ratio, wherein a first distance between the first image block and the second image block is smaller than a second distance corresponding to the expansion ratio.

7. The method according to any one of claims 1 to 4, wherein the first feature map and the second feature map are extracted by an image segmentation model, the target image is a training sample image of the image segmentation model, and the first image block and the second image block are a pair of image blocks input in the current training process;

the method further comprises the following steps:

determining the category corresponding to each pixel in the third feature map and the fourth feature map according to the pixel category supervision information corresponding to the first image block and the second image block respectively;

determining a summation result of feature vectors corresponding to each pixel belonging to a target category in the third feature map and the fourth feature map, and taking the summation result as a category central feature vector corresponding to the target category in the training process;

adding the class central feature vector corresponding to the target class in the training process and the class central feature vector corresponding to the target class in the previous training processes;

respectively updating the feature vectors of the pixels corresponding to the target category in the third feature map and the fourth feature map by using the added category center feature vector corresponding to the target category to obtain an eighth feature map and a ninth feature map;

and determining the image segmentation result of the first image block and the second image block according to the eighth feature map and the ninth feature map.

8. The method of claim 7, further comprising:

determining a first loss corresponding to the first image block and a second loss corresponding to the second image block according to image segmentation results of the first image block and the second image block and pixel class supervision information corresponding to the first image block and the second image block respectively;

and training the image segmentation model according to the sum result of the first loss and the second loss.

9. An image segmentation apparatus, comprising:

10. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the image segmentation method according to any one of claims 1 to 8.

11. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the image segmentation method of any one of claims 1 to 8.

12. An image segmentation method, comprising:

partitioning a target image to obtain a plurality of image blocks;

13. An image segmentation method applied to an augmented reality device includes:

partitioning the obtained target image to obtain a plurality of image blocks;

displaying the target image marked with the image segmentation result.