CN111179283A

CN111179283A - Image semantic segmentation method and device and storage medium

Info

Publication number: CN111179283A
Application number: CN201911397645.2A
Authority: CN
Inventors: 张展鹏; 成慧; 张凯鹏
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19
Also published as: TW202125408A; WO2021134970A1; TWI728791B; KR20210088546A; JP2022518647A

Abstract

The present disclosure provides an image semantic segmentation method and apparatus, and a storage medium, wherein the method includes: performing feature extraction on the acquired image to be processed to obtain a first feature image; synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images; determining a target image according to at least the second characteristic images, and taking the target image as a new first characteristic image to synchronously extract a plurality of context characteristics with different ranges; and generating a semantic image corresponding to the image to be processed based on the target image obtained at the last time in response to the fact that the number of times of synchronously extracting the context features with different ranges from the first feature image reaches a target number of times. According to the method and the device, the context features in different ranges are synchronously extracted from the feature image corresponding to the image to be processed for multiple times, the context information in different scales is fully fused, and the semantic segmentation precision is improved.

Description

Image semantic segmentation method and device and storage medium

Technical Field

The present disclosure relates to the field of deep learning, and in particular, to a method and an apparatus for segmenting image semantics, and a storage medium.

Background

For movable machine equipment, semantic segmentation can be performed on images collected by a camera loaded on the movable machine equipment to obtain semantic understanding of a scene, so that functions of obstacle avoidance, navigation and the like are realized.

At present, a semantic segmentation technology based on deep learning has obtained a great breakthrough, however, the segmentation technology is often based on a more complex deep neural network and needs to consume more computing resources. On the one hand, the computational resources of mobile machine devices are often relatively limited for cost and mobility considerations. On the other hand, mobile machine devices need to interact with real-world environments in real time. Therefore, how to perform real-time semantic segmentation under limited computing resources is a challenging technical problem.

Disclosure of Invention

The disclosure provides an image semantic segmentation method and device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation method, including: performing feature extraction on the acquired image to be processed to obtain a first feature image; synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images; determining a target image according to at least the second characteristic images, and taking the target image as a new first characteristic image to synchronously extract a plurality of context characteristics with different ranges; and generating a semantic image corresponding to the image to be processed based on the target image obtained at the last time in response to the fact that the number of times of synchronously extracting the context features with different ranges from the first feature image reaches a target number of times.

In some optional embodiments, the synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images includes: synchronously performing dimensionality reduction on the first characteristic image by a plurality of channels to obtain a plurality of third characteristic images; and extracting context features with different ranges from at least two of the third feature images to obtain a plurality of second feature images.

In some optional embodiments, the extracting context features of different ranges from at least two of the third feature images to obtain a second feature image includes: and extracting context features with different ranges from at least two of the third feature images by adopting the cavity convolution with depth separable convolution and convolution kernels corresponding to different cavity coefficients to obtain a plurality of second feature images.

In some optional embodiments, the determining the target image based on at least the plurality of second feature images comprises: at least fusing the plurality of second characteristic images to obtain a fourth characteristic image; and determining the target image at least according to the fourth characteristic image.

In some optional embodiments, the fusing at least the plurality of second feature images to obtain a fourth feature image includes: superposing the plurality of second characteristic images to obtain a fourth characteristic image; or at least one of the plurality of second characteristic images and the plurality of third characteristic images is superposed to obtain the fourth characteristic image.

In some optional embodiments, the determining the target image based on at least the fourth feature image comprises: up-sampling the fourth characteristic image to obtain the target image; or performing sub-pixel convolution on the fourth characteristic image to obtain the target image.

In some optional embodiments, the method further comprises: performing feature extraction and dimension reduction on the image to be processed to obtain a fifth feature image; the number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the layers of feature extraction corresponding to the first feature image; determining the target image at least according to the fourth feature image, including: under the condition that the times are smaller than the target times, the fourth feature image and the fifth feature image are overlapped and then are subjected to up-sampling, and the target image is obtained; or, when the number of times is smaller than the target number of times, superposing an image obtained by performing sub-pixel convolution on the fourth feature image and the fifth feature image to obtain the target image.

In some optional embodiments, a dimension corresponding to the target image obtained last time is a target dimension; wherein the target dimension is determined according to a total number of object classes included in the preset semantic image.

In some optional embodiments, after generating the semantic image corresponding to the image to be processed, the method further includes: and navigating the machine equipment according to the semantic image.

According to a second aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation apparatus, the apparatus including: the characteristic extraction module is used for extracting the characteristics of the acquired image to be processed to obtain a first characteristic image; the context feature extraction module is used for synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images; the determining module is used for determining a target image at least according to the plurality of second characteristic images, and synchronously extracting a plurality of context characteristics with different ranges again by taking the target image as a new first characteristic image; and the semantic image generating module is used for responding to the condition that the number of times of synchronously extracting the context features with different ranges from the first feature image reaches a target number of times, and generating a semantic image corresponding to the image to be processed based on the target image obtained at the last time.

In some optional embodiments, the contextual feature extraction module comprises: the first processing submodule is used for synchronously carrying out dimensionality reduction on the first characteristic image by a plurality of channels to obtain a plurality of third characteristic images; and the second processing submodule is used for extracting context features with different ranges from at least two of the third feature images to obtain a plurality of second feature images.

In some optional embodiments, the second processing submodule comprises: and extracting context features with different ranges from at least two of the third feature images by adopting the cavity convolution with depth separable convolution and convolution kernels corresponding to different cavity coefficients to obtain a plurality of second feature images.

In some optional embodiments, the determining module comprises: the first determining submodule is used for at least fusing the second characteristic images to obtain a fourth characteristic image; and the second determining submodule is used for determining the target image at least according to the fourth characteristic image.

In some optional embodiments, the first determining sub-module comprises: superposing the plurality of second characteristic images to obtain a fourth characteristic image; or at least one of the plurality of second characteristic images and the plurality of third characteristic images is superposed to obtain the fourth characteristic image.

In some optional embodiments, the second determining sub-module comprises: up-sampling the fourth characteristic image to obtain the target image; or performing sub-pixel convolution on the fourth characteristic image to obtain the target image.

In some optional embodiments, the apparatus further comprises: the processing module is used for performing feature extraction and dimension reduction processing on the image to be processed to obtain a fifth feature image; the number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the layers of feature extraction corresponding to the first feature image; the second determination submodule includes: under the condition that the times are smaller than the target times, the fourth feature image and the fifth feature image are overlapped and then are subjected to up-sampling, and the target image is obtained; or, when the number of times is smaller than the target number of times, superposing an image obtained by performing sub-pixel convolution on the fourth feature image and the fifth feature image to obtain the target image.

In some optional embodiments, the apparatus further comprises: and the navigation module is used for navigating the machine equipment according to the semantic image.

According to a third aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, where the storage medium stores a computer program for executing the image semantic segmentation method according to any one of the first aspect.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation apparatus, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to invoke executable instructions stored in the memory to implement the image semantic segmentation method of any one of the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, feature extraction may be performed on an acquired image to be processed, so as to obtain a first feature image, and then a plurality of context features with different ranges are synchronously extracted from the first feature image, so as to obtain a plurality of second feature images. And determining a target image according to at least the plurality of second characteristic images, taking the target image as a new first characteristic image, and synchronously extracting a plurality of context characteristics with different ranges again. When the number of times of synchronously extracting the context features with different ranges from the first feature image reaches the target number of times, the semantic image corresponding to the image to be processed can be generated by semantic segmentation based on the target image obtained at the last time. According to the embodiment of the invention, the context features with different ranges are synchronously extracted from the feature image corresponding to the image to be processed for multiple times, the context information with different scales is fully fused, and the semantic segmentation precision is improved.

In the embodiment of the present disclosure, the first feature image may be subjected to dimension reduction processing in multiple channels to obtain multiple third feature images, and then context features with different ranges are extracted from at least two of the multiple third feature images to obtain corresponding multiple second feature images. The method achieves the purpose of synchronously extracting the context features with different ranges from the first feature image, is beneficial to improving the accuracy of semantic segmentation and reduces the calculation amount in the semantic segmentation process.

In the embodiment of the disclosure, the depth separable convolution and the cavity convolution with convolution kernels corresponding to different cavity coefficients may be adopted to extract context features with different ranges from at least two of the plurality of third feature images, so as to achieve the purpose of synchronously extracting the context features with different ranges from the first feature image, and improve the accuracy of semantic segmentation.

In the embodiment of the disclosure, the plurality of second feature images may be directly superimposed to obtain the fourth feature image, or at least one of the plurality of second feature images and the plurality of third feature images may be further superimposed to obtain the fourth feature image, so that the usability is high, information of more scales can be fused, and the accuracy of performing semantic segmentation is improved.

In the embodiment of the present disclosure, in order to maintain the dimensionality of the target image, the fourth feature image may be up-sampled, so as to obtain the target image. Or the fourth characteristic image can be subjected to sub-pixel convolution, so that the semantic segmentation effect is improved, and the semantic segmentation result is more accurate.

In the embodiment of the present disclosure, the fifth feature image may be acquired before the target image is determined. And the fifth characteristic image is an image obtained by extracting low-dimensional image characteristics from the image to be processed. The number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the feature extraction corresponding to the first feature image. And the fourth characteristic image and the fifth characteristic image are superposed and then are subjected to up-sampling to obtain the target image, and under the condition that the number of times of synchronously extracting a plurality of context characteristics with different ranges from the first characteristic image reaches the target number of times, only the fourth characteristic image is subjected to up-sampling to obtain the target image, so that the possibility of losing some important characteristics in the image to be processed after the dimension reduction processing is reduced, and the accuracy of semantic segmentation is improved.

In the embodiment of the present disclosure, a dimension of the target image obtained last time is a target dimension, where the target dimension is determined according to a total number of object categories included in the preset semantic image. And ensuring that the dimensionality of the finally obtained semantic image is consistent with the dimensionality of the image to be processed.

In the embodiment of the disclosure, the machine equipment navigation can be performed according to the generated semantic image corresponding to the image to be processed, and the usability is high.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1A is a color image shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 1B is a semantic image illustrating the present disclosure according to an exemplary embodiment;

FIG. 2 is a flow chart illustrating a method of semantic segmentation of an image according to an exemplary embodiment of the present disclosure;

FIG. 3 is a flow diagram illustrating another method of semantic segmentation of an image according to an exemplary embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a scenario in which different scope contextual feature extraction is performed according to an example embodiment of the present disclosure;

FIG. 5 is a flow diagram illustrating another method of image semantic segmentation according to an exemplary embodiment of the present disclosure;

FIG. 6 is a flow diagram illustrating another method of semantic segmentation of an image according to an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a neural network architecture for obtaining semantic images, illustrating the present disclosure, in accordance with an exemplary embodiment;

FIG. 8A is an architectural schematic diagram illustrating one type of back terminal network according to an exemplary embodiment of the present disclosure;

FIG. 8B is an architectural schematic diagram of another back terminal network shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 8C is an architectural schematic diagram of another back terminal network shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 8D is an architectural schematic diagram of another back terminal network shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 9 is a flow diagram illustrating another method of image semantic segmentation according to an exemplary embodiment of the present disclosure;

FIG. 10 is a block diagram of an image semantic segmentation apparatus shown in accordance with an exemplary embodiment of the present disclosure;

fig. 11 is a schematic structural diagram illustrating an apparatus for semantic segmentation of an image according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as operated herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

The embodiment of the disclosure provides an image semantic segmentation method, which can be used for machine equipment, such as movable machine equipment of robots, unmanned vehicles, unmanned planes and the like. Alternatively, the method provided by the embodiment of the disclosure can be realized by a processor running computer executable codes.

The image semantic segmentation is to estimate the type of an object to which each pixel point in an input RGB (Red Green Blue, color) image belongs, where the type of the object may include, but is not limited to, various objects, such as grasslands, people, cars, buildings, sky, and the like, and obtain a semantic graph with the same size and dimension as the RGB image and a tag of the type of the object to which the pixel point belongs. For example, fig. 1A is an RGB image and fig. 1B is a corresponding semantic image.

According to the method and the device for determining the target image, the first characteristic image is obtained by performing characteristic extraction on the image to be processed acquired by the machine equipment, and then the plurality of context characteristics with different ranges are synchronously extracted from the first characteristic image for a plurality of times to obtain the plurality of second characteristic images, so that the target image is determined at least according to the plurality of second characteristic images. Finally, a semantic image can be generated based on the target image obtained last time. According to the embodiment of the invention, context information of different scales can be fully fused through multiple times of context feature extraction and fusion, and the precision of semantic segmentation is improved. The machine equipment can avoid the obstacle in front of the machine equipment according to the semantic image corresponding to the image to be processed, reasonably plans a driving route, and is high in availability.

The above is only an exemplary application scenario of the present disclosure, and other scenarios that can be used in the image semantic segmentation method of the present disclosure all belong to the protection scope of the present disclosure.

As shown in fig. 2, fig. 2 is a diagram illustrating an image semantic segmentation method according to an exemplary embodiment, including the following steps:

in step 101, feature extraction is performed on the acquired image to be processed to obtain a first feature image.

In the embodiment of the disclosure, the image to be processed may be a real-time image, and the real-time image may be acquired by a camera preset on the machine device, and includes various objects located in front of the moving route of the machine device. The image to be processed may also be an image that has been acquired by the machine device before, or an image that needs to be semantically segmented and is sent to the machine device by other devices.

The original image information included in the image to be processed is converted into a Group of features with obvious physical significance or statistical significance, so that a first feature image can be obtained, or high-dimensional image features can be extracted from the image to be processed through a convolution network, such as a ResNet (residual error network), a VGG (Visual Geometry Group) and other networks, so that the first feature image is obtained.

When extracting features of an image to be processed, features such as Haar (Haar-like features), LBP (Local Binary Pattern), HOG (histogram of Oriented gradients), and the like, may be extracted from the image to be processed. Haar describes the pixel value shading information of the image in the local range, LBP describes the corresponding texture information of the image in the local range, and HOG describes the corresponding shape edge gradient information of the image in the local range. Alternatively, high-dimensional visual features of the image to be processed may be extracted.

In step 102, a plurality of context features with different ranges are synchronously extracted from the first feature image, and a plurality of second feature images are obtained.

In the embodiment of the present disclosure, the context feature extraction is statistics of distribution conditions of other pixel points in a pixel point neighborhood in the first feature image.

The context feature extraction with different ranges refers to context feature extraction performed at different pixel numbers, for example, when the context feature extraction is performed on the first feature image, the context feature extraction may be performed on a plurality of pixel points included in the first feature image at intervals of 3, 7, and 12 pixel points, so as to obtain a plurality of second feature images respectively.

In step 103, a target image is determined at least according to the plurality of second feature images, and the target image is taken as a new first feature image to synchronously extract a plurality of context features with different ranges again.

In the embodiment of the present disclosure, the target image is an image obtained from at least a plurality of second feature images at a time. After the target image is determined, the target image may be used as a new first feature image, and the step 102 is executed again.

In step 104, in response to that the number of times of synchronously extracting the plurality of context features with different ranges from the first feature image reaches a target number of times, generating a semantic image corresponding to the image to be processed based on the target image obtained last time.

In the disclosed embodiment, the target number of times may be a positive integer greater than or equal to 2.

In the above embodiment, feature extraction may be performed on the acquired image to be processed, so as to obtain a first feature image, and then a plurality of context features with different ranges are synchronously extracted from the first feature image, so as to obtain a plurality of second feature images. And determining a target image according to at least the plurality of second characteristic images, taking the target image as a new first characteristic image, and synchronously extracting a plurality of context characteristics with different ranges again. When the number of times of synchronously extracting the context features with different ranges from the first feature image reaches the target number of times, the semantic image corresponding to the image to be processed can be generated by semantic segmentation based on the target image obtained at the last time. According to the embodiment of the invention, the context features with different ranges are synchronously extracted from the feature image corresponding to the image to be processed for multiple times, so that the context information with different scales can be fully fused, and the semantic segmentation precision is improved.

In some optional embodiments, for step 101, a feature extraction network may be adopted, and the acquired image to be processed is input into the feature extraction network, so that the first feature image is output by the feature extraction network. The feature extraction network may be a neural network such as Resnet and VGG that can perform feature extraction.

In some alternative embodiments, such as shown in FIG. 3, step 102 may include:

in step 102-1, the first feature image is divided into a plurality of channels to be synchronously subjected to dimensionality reduction processing, and a plurality of third feature images are obtained.

In the embodiment of the disclosure, the dimension reduction processing is performed on the first feature image to better perform context feature extraction subsequently, which is beneficial to reducing the calculation amount of subsequent processing. The first feature image is subjected to dimension reduction processing by a plurality of channels, and then context features with different ranges can be respectively extracted from the dimension-reduced images corresponding to the plurality of channels.

In the embodiment of the present disclosure, the dimension reduction processing of the same dimension may be performed on the first feature image by dividing into a plurality of channels, for example, as shown in fig. 4, the dimension of a plurality of third feature images obtained after performing the multi-channel dimension reduction processing on the convolution layer using the 1 × 1 convolution kernel may be 1 × 1 × 256 dimensions.

In step 102-2, context features with different ranges are extracted from at least two of the third feature images, and a plurality of second feature images are obtained.

In the embodiment of the present disclosure, at least two context features in different ranges may be extracted from the plurality of third feature images by using a hole convolution with a depth-separable convolution and a convolution kernel corresponding to different hole coefficients. For the hole convolution, a convolution kernel of 3 × 3 size may be selected, or a convolution kernel of 5 × 5 or 7 × 7 size may be used. The hole coefficient r of the hole convolution can be set to different values according to the semantically segmented scene, for example, r can be set to 6, 12, 18, 32, and the like, and the context feature extraction can be performed at different pixel point numbers according to the value of r.

For example, as shown in fig. 4, after performing the dimension reduction processing on the first feature image for 4 channels, 4 third feature images are obtained, the context feature extraction may not be performed on the third feature image 1, and the values of the void coefficients r corresponding to the third feature images 2, 3, and 4 are 6, 12, and 18, that is, the context feature is extracted for each interval of 6 pixel points, 12 pixel points, and 18 pixel points of the third feature images 2, 3, and 4, so as to obtain three second feature images.

In the above embodiment, the first feature image may be subjected to dimension reduction processing in multiple channels to obtain multiple third feature images, and then the corresponding multiple second feature images are obtained by extracting context features with different ranges from at least two of the multiple third feature images. The method achieves the purpose of synchronously extracting the context features with different ranges from the first feature image, is beneficial to improving the accuracy of semantic segmentation and reduces the calculation amount in the semantic segmentation process.

In some alternative embodiments, such as shown in fig. 5, the step 103 of determining the target image according to at least the plurality of second feature images may include:

in step 103-1, at least the plurality of second feature images are fused to obtain a fourth feature image.

In the embodiment of the present disclosure, at least a plurality of second feature images obtained in the above steps may be superimposed to obtain a fourth feature image.

For example, a plurality of second feature images are piled up, and then fusion of the multi-scale context features is realized through a convolution operation, so that a fourth feature image is obtained. And splicing the plurality of second characteristic images to obtain a fourth characteristic image. In step 103-2, the target image is determined based on at least the fourth feature image.

In one possible implementation, the fourth feature image may be directly used as the target image. In another possible implementation manner, processing that can improve the semantic segmentation effect may be performed on the fourth feature image, so as to obtain the target image. In another possible implementation manner, the target image may also be determined according to the fourth feature image and other feature images associated with the image to be processed.

In the above embodiment, the target image may be determined based on at least the plurality of second feature images, and usability is high.

In some optional embodiments, for step 103-1, in one possible implementation, a plurality of second feature images may be superimposed to obtain a fourth feature image. In order to better retain the feature information corresponding to the image to be processed and improve the accuracy of semantic segmentation, in another possible implementation manner, at least one of the plurality of second feature images and the plurality of third feature images may be superimposed, and an image obtained after the superimposition is used as a fourth feature image.

In the embodiment of the present disclosure, the plurality of second feature images and at least one of the plurality of third feature images may be stacked together, and then a convolution operation is performed to realize the fusion of the multi-scale context features, so as to obtain a fourth feature image. And at least one of the plurality of second characteristic images and the plurality of third characteristic images can be spliced to obtain a fourth characteristic image.

In the above embodiment, the plurality of second feature images may be directly superimposed to obtain the fourth feature image, or the plurality of second feature images and at least one of the plurality of third feature images that are not subjected to the context feature extraction may be further superimposed to obtain the fourth feature image, so that the usability is high, information of more scales can be fused, and the accuracy of performing semantic segmentation is improved.

In some alternative embodiments, the target image may be determined in any of the following ways for step 103-2.

In a possible implementation manner, the fourth feature image is up-sampled to obtain the target image.

In the embodiment of the present disclosure, since the target image subsequently needs to be subjected to dimension reduction processing or semantic image generation, in order to maintain the dimension of the target image, the fourth feature image needs to be up-sampled. After the fourth feature image is determined, the fourth feature image is directly up-sampled (e.g., linear interpolation) to obtain the target image. And then the target image is taken as a new first characteristic image, and the step 102 is executed.

When the fourth feature image is up-sampled, the corresponding up-sampling factor t may be 2, 4, 8, and the like, and the same or different up-sampling factors may be used each time the fourth feature image is up-sampled. When the up-sampling factor is used for amplifying the original image, a proper interpolation algorithm is adopted to insert new pixel points between the pixel points, for example, when the up-sampling factor t is 2, a linear interpolation algorithm can be adopted to insert 2 new pixel points between two adjacent pixel points.

In another possible implementation manner, the fourth feature image is subjected to sub-pixel convolution to obtain the target image.

The sub-pixel convolution increases the spatial resolution of the feature map by tiling the pixels in the depth direction of the output feature map so that the depth of the feature map becomes smaller and the spatial scale of the two-dimensional plane becomes larger.

The effect of semantic segmentation can be improved by performing sub-pixel convolution on the fourth characteristic image, so that the semantic segmentation result is more accurate. Further, after the sub-pixel convolution, upsampling may be performed, and by the upsampling, a target image may be obtained, and then the target image is used as a new first feature image, and the step 102 is executed again.

In another possible implementation manner, considering that the first feature image is subjected to the dimension reduction processing before, the subsequent images are obtained based on a plurality of third feature images after the dimension reduction processing, but the finally generated semantic image is a high-dimensional image as the image to be processed, and in order to reduce the possibility that some important features in the image to be processed are lost after the dimension reduction processing, the fifth feature image may be acquired before the target image is determined.

And the fifth characteristic image is an image obtained by extracting low-dimensional image characteristics from the image to be processed. The number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the feature extraction corresponding to the first feature image. For example, if 10 layers of feature extraction are performed on the image to be processed to obtain the first feature image, the image obtained by performing feature extraction on the first 4 layers may be used as the fifth feature image.

Accordingly, for example, as shown in fig. 6, the method may further include:

in step 105, a fifth feature image is obtained after feature extraction and dimension reduction processing are performed on the image to be processed.

In this implementation manner, the fourth feature image and the fifth feature image may be subjected to upsampling after being superimposed, so as to obtain the target image.

When the number of times of synchronously extracting the plurality of context features in different ranges from the first feature image is less than the target number of times, step 103-2 may perform upsampling after superimposing the fourth feature image and the fifth feature image to obtain the target image, and when the number of times of synchronously extracting the plurality of context features in different ranges from the first feature image reaches the target number of times, may perform upsampling only on the fourth feature image to obtain the target image.

And taking the target image as a new first characteristic image, and returning to execute the step 102. The corresponding upsampling factor t may be the same or different each time upsampling is performed.

In another possible implementation, the target image may also be determined from the fourth feature image and the fifth feature image.

And under the condition that the times of synchronously extracting a plurality of context features with different ranges from the first feature image are less than the target times, superposing an image obtained by performing sub-pixel convolution on the fourth feature image and the fifth feature image to obtain the target image. And under the condition that the times of synchronously extracting a plurality of context features with different ranges from the first feature image reach the target times, directly performing sub-pixel convolution on the fourth feature image to obtain the target image.

In the embodiment of the present disclosure, in order to ensure the effect of semantic segmentation, when the number of times of synchronously extracting a plurality of context features in different ranges from the first feature image is less than the target number of times, the fourth feature image may be subjected to sub-pixel convolution first, and the obtained image and the fifth feature image are superimposed to obtain the target image. If the number of times reaches the target number of times, the sub-pixel convolution can be directly carried out on the fourth characteristic image to obtain the target image. And after the fourth characteristic image is subjected to sub-pixel convolution, up-sampling can be further performed.

And then the target image is taken as a new first characteristic image, and the step 102 is executed.

It should be noted that, each time after the target image is determined, when the target image is taken as a new first feature image to synchronously extract a plurality of context features with different ranges again, the dimensions of a plurality of new third feature images obtained when the new first feature image is subjected to the dimension reduction processing by a plurality of channels synchronously may be the same as or different from the dimensions of a plurality of third feature images obtained after the previous dimension reduction processing. For example, the dimension reduction processing is performed on the first feature image by a plurality of channels synchronously at the last time to obtain a plurality of third feature images with dimensions of 1 × 1 × 256, and the dimension reduction processing is performed on the new first feature image by a plurality of channels synchronously to obtain a plurality of new third feature images with dimensions of 1 × 1 × 128.

Further, the hole coefficients at the time of performing hole convolution for the plurality of third feature images may be the same or different. For example, when the hole convolution was performed on at least two of the plurality of third feature images at the previous time, the corresponding hole coefficients were 6, 12, and 18, respectively, and when the hole convolution was performed on at least two of the new plurality of third feature images, the corresponding hole coefficients were 6 and 12, respectively.

In the above embodiment, at least one target image may be determined according to the fourth feature image, so as to ensure precision and accuracy of semantic segmentation, and have high usability.

In some optional embodiments, in order to ensure that the dimension of the finally obtained semantic image is consistent with the dimension of the image to be processed, dimension reduction and/or dimension increase processing may be performed before the target image is output, so as to ensure that the dimension of the target image is the target dimension. Wherein the target dimension is determined according to the total number of object categories included in the preset semantic image.

For example, the target dimension may be 1 × 1 × 16N, where N is the preset total number of object classes included in the semantic image. If 4 classes of object classes need to be analyzed in the semantic image, the target dimension may be 1 × 1 × 64.

In the above embodiment, the dimension corresponding to the target image obtained last time may be subjected to dimensionality reduction and/or dimensionality enhancement processing (for example, convolution operation is performed by using convolutional layers with a preset number of channels) before the target image is output, so that the dimension of the target image is ensured to be associated with the target dimension, and the accuracy and precision of semantic segmentation are improved.

In some optional embodiments, for step 104, after the target image is obtained for the last time, the semantic image may be generated using an interpolation algorithm, which may include, but is not limited to, a bilinear interpolation algorithm.

For further illustration of the above embodiment, for example, as shown in fig. 7, the acquired image to be processed may be input into a fully-convolved neural network, and the fully-convolved neural network outputs a corresponding semantic image.

The full convolution neural network may include a front terminal network and a back terminal network.

The front terminal network may be a feature extraction network, and neural networks such as Resnet and VGG may be used.

In the process of training the front-end terminal network, a manually labeled image classification sample data set, such as ImageNet, may be used. The ImageNet set comprises images and corresponding image characteristic labels, and the output result of the front terminal network is matched with the label content in the ImageNet sample set or is in a fault-tolerant range by adjusting the network parameters of the front terminal network.

The first characteristic image corresponding to the image to be processed can be obtained through the front terminal network, and the first characteristic image is further input into the rear terminal network to obtain the semantic image output by the rear terminal network.

When the post terminal network is trained, a sample set, such as cityscaps, can be segmented by adopting artificially labeled image semantics, network parameters of the whole neural network are trained through a back propagation algorithm, including network parameters of a front terminal network and a back terminal network, and the output result of the post terminal network is matched with the label content in the cityscaps sample set or is in a fault-tolerant range.

For convenience of describing a network architecture adopted by the backend network, in the embodiment of the present disclosure, the target number is only illustrated as 2, and it should be noted that when the target number is other positive integer values greater than 2, all of the target numbers belong to the protection scope of the present disclosure.

In one possible implementation, the network architecture of the back-end network may be as shown in fig. 8A.

Through the sub-network 1, the first characteristic image is divided into a plurality of channels to be synchronously subjected to dimensionality reduction, and a plurality of third characteristic images are obtained. And then, for at least two context features with different extraction ranges in the plurality of third feature images, a plurality of second feature images can be obtained through depth separable convolution and cavity convolution with convolution kernels corresponding to different cavity coefficients.

Further, the plurality of second feature images may be superimposed and up-sampled (the up-sampling process is not shown in fig. 8A) to obtain the target image, or the plurality of first images and at least one third feature image that is not subjected to the context feature extraction may be superimposed and up-sampled to obtain the target image.

Directly taking the target image as a new first characteristic image, synchronously performing dimensionality reduction on the new first characteristic image by a plurality of channels through a sub-network 2, and obtaining a plurality of third characteristic images again. And then, for at least two context features with different extraction ranges in the plurality of third feature images, for example, a depth separable convolution can be performed and a convolution kernel can perform a hole convolution corresponding to different hole coefficients to obtain a plurality of second feature images. And superposing the plurality of second characteristic images for up-sampling to obtain a target image, or superposing the plurality of first images and at least one third characteristic image which is not subjected to the context characteristic extraction for up-sampling to obtain the target image.

And generating the semantic image by adopting a bilinear interpolation algorithm on the target image output by the sub-network 2.

In the embodiment, the first feature image can be synchronously extracted for multiple times, the context features in multiple different ranges are extracted and fused, the context information in different scales is fully fused, and the semantic segmentation precision is improved. And because the depth separable hole convolution is adopted, the calculated amount in the semantic segmentation process is reduced.

In another possible implementation, the network architecture of the back-end network may be as shown in fig. 8B.

Through the sub-network 1, the first characteristic image is divided into a plurality of channels to be synchronously subjected to dimensionality reduction, and a plurality of third characteristic images are obtained. Then, at least two context features with different ranges in the plurality of third feature images are extracted, for example, a depth separable hole convolution operation may be performed, and hole coefficients are different from each other, so as to obtain a plurality of second feature images.

In order to improve the effect of semantic segmentation, a plurality of second feature images can be overlaid and then sub-pixel convolution and upsampling are carried out to obtain a target image. Or the plurality of second feature images and at least one third feature image which is not subjected to the context feature extraction may be superimposed and then subjected to sub-pixel convolution and upsampling (the upsampling process is not shown in fig. 8B), so as to obtain the target image.

Directly taking the target image as a new first characteristic image, synchronously performing dimensionality reduction on the new first characteristic image by a plurality of channels through a sub-network 2, and obtaining a plurality of third characteristic images again. Then, at least two context features with different ranges in the plurality of third feature images are extracted, for example, a depth separable hole convolution operation may be performed, and hole coefficients are different from each other, so as to obtain a plurality of second feature images. And superposing the plurality of second feature images again to perform sub-pixel convolution and upsampling to obtain a target image, or superposing the plurality of first images and at least one third feature image which is not subjected to context feature extraction to perform sub-pixel convolution and upsampling to obtain the target image.

In the embodiment, the first feature image can be synchronously extracted for multiple times, the context features in multiple different ranges are extracted and fused, the context information in different scales is fully fused, and the semantic segmentation precision is improved. And because the depth separable hole convolution is adopted, the calculated amount in the semantic segmentation process is reduced. In addition, the effect of semantic segmentation can be improved by sub-pixel convolution.

In another possible implementation, the network architecture of the back-end network may be as shown in fig. 8C.

Further, the plurality of second feature images and at least one third feature image that is not subjected to the context feature extraction may be superimposed, and then superimposed with the fifth feature image, and the superimposed image is up-sampled (the up-sampling process is not shown in fig. 8C), so as to obtain the target image. The number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the feature extraction corresponding to the first feature image.

Directly taking the target image as a new first characteristic image, synchronously performing dimensionality reduction on the new first characteristic image by a plurality of channels through a sub-network 2, and obtaining a plurality of third characteristic images again. Then, at least two context features with different ranges in the plurality of third feature images are extracted, for example, a depth separable hole convolution operation may be performed, and hole coefficients are different from each other, so as to obtain a plurality of second feature images. And superposing the plurality of second characteristic images and at least one third characteristic image which is not subjected to the context characteristic extraction, and then performing up-sampling on the superposed images to obtain the target image.

In another possible implementation, the network architecture of the back-end network may be as shown in fig. 8D.

Further, after the plurality of second feature images and at least one third feature image without context feature extraction are superimposed, sub-pixel convolution and upsampling are performed (the upsampling process is not shown in fig. 8D), and then the second feature images and the fifth feature image are superimposed to obtain the target image. The number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the feature extraction corresponding to the first feature image.

Directly taking the target image as a new first characteristic image, synchronously performing dimensionality reduction on the new first characteristic image by a plurality of channels through a sub-network 2, and obtaining a plurality of third characteristic images again. Then, at least two context features with different ranges in the plurality of third feature images are extracted, for example, a depth separable hole convolution operation may be performed, and hole coefficients are different from each other, so as to obtain a plurality of second feature images. And superposing the plurality of second characteristic images and at least one third characteristic image which is not subjected to the context characteristic extraction, and performing sub-pixel convolution and up-sampling on the superposed images to obtain a target image.

In addition, in order to ensure that the dimension of the target image is the target dimension, after the plurality of second feature images and at least one third feature image which is not subjected to the context feature extraction are superposed, dimension reduction processing and dimension lifting processing are performed on the superposed images, and then sub-pixel convolution and up-sampling are performed to obtain the target image.

In the embodiment, the first feature image can be synchronously extracted for multiple times, the context features in multiple different ranges are extracted and fused, the context information in different scales is fully fused, and the semantic segmentation precision is improved. Due to the adoption of the depth separable hole convolution, the calculation amount in the semantic segmentation process is reduced. In addition, the fifth characteristic image is adopted to determine the target image, so that important information in the image to be detected can be ensured not to be lost, and the semantic segmentation effect is improved.

In some alternative embodiments, such as shown in fig. 9, after completing step 104, the method may further comprise:

in step 106, machine device navigation is performed according to the semantic image.

In the embodiment of the disclosure, the machine device can be navigated according to the generated semantic image. For example, if the semantic image includes an obstacle, navigation for avoiding the obstacle can be performed, and if the semantic image includes a fork, it can be determined whether to go straight or turn according to a specified route.

In the embodiment, the machine equipment navigation can be performed according to the generated semantic image corresponding to the image to be processed, and the usability is high.

Corresponding to the foregoing method embodiments, the present disclosure also provides embodiments of an apparatus.

As shown in fig. 10, fig. 10 is a block diagram of an image semantic segmentation apparatus according to an exemplary embodiment, the apparatus including: the feature extraction module 210 is configured to perform feature extraction on the acquired image to be processed to obtain a first feature image; a context feature extraction module 220, configured to synchronously extract a plurality of context features with different ranges from the first feature image, so as to obtain a plurality of second feature images; a determining module 230, configured to determine a target image according to at least the plurality of second feature images, and extract a plurality of context features with different ranges from the target image as a new first feature image synchronously again; and a semantic image generating module 240, configured to generate a semantic image corresponding to the image to be processed based on the target image obtained last time in response to that the number of times of synchronously extracting the plurality of context features with different ranges from the first feature image reaches a target number of times.

In some optional embodiments, the apparatus further comprises: the processing module is used for performing feature extraction and dimension reduction processing on the image to be processed to obtain a fifth feature image; the number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the layers of feature extraction corresponding to the first feature image; the second determination submodule includes: under the condition that the times are smaller than the target times, the fourth feature image and the fifth feature image are overlapped and then are subjected to up-sampling, and the target image is obtained; or under the condition that the times are smaller than the target times, superposing an image obtained by performing sub-pixel convolution on the fourth characteristic image and the fifth characteristic image to obtain the target image.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the disclosure also provides a computer-readable storage medium, which stores a computer program, and the computer program is used for executing any one of the image semantic segmentation methods.

In some optional embodiments, the disclosed embodiments provide a computer program product comprising computer readable code which, when run on a device, a processor in the device executes instructions for implementing the image semantic segmentation method as provided in any of the above embodiments.

In some optional embodiments, the present disclosure further provides another computer program product for storing computer readable instructions, which when executed, cause a computer to perform the operations of the image semantic segmentation method provided in any one of the above embodiments.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The embodiment of the present disclosure further provides an image semantic segmentation apparatus, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to call the executable instructions stored in the memory to implement any one of the image semantic segmentation methods described above.

Fig. 11 is a schematic hardware structure diagram of an image semantic segmentation apparatus according to an embodiment of the present application. The image semantic segmentation apparatus 310 includes a processor 311, and may further include an input device 312, an output device 313, and a memory 314. The input device 312, the output device 313, the memory 314, and the processor 311 are connected to each other via a bus.

The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.

The input means are for inputting data and/or signals and the output means are for outputting data and/or signals. The output means and the input means may be separate devices or may be an integral device.

The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU.

The memory is used to store program codes and data of the network device.

The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 11 only shows a simplified design of an image semantic segmentation means. In practical applications, the image semantic segmentation apparatus may further include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all the image semantic segmentation apparatuses that may implement the embodiments of the present application are within the scope of the present application.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. An image semantic segmentation method, comprising:

performing feature extraction on the acquired image to be processed to obtain a first feature image;

synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images;

determining a target image according to at least the second characteristic images, and taking the target image as a new first characteristic image to synchronously extract a plurality of context characteristics with different ranges;

and generating a semantic image corresponding to the image to be processed based on the target image obtained at the last time in response to the fact that the number of times of synchronously extracting the context features with different ranges from the first feature image reaches a target number of times.

2. The method according to claim 1, wherein the synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images comprises:

synchronously performing dimensionality reduction on the first characteristic image by a plurality of channels to obtain a plurality of third characteristic images;

and extracting context features with different ranges from at least two of the third feature images to obtain a plurality of second feature images.

3. The method according to claim 2, wherein the extracting context features with different ranges from at least two of the plurality of third feature images to obtain a plurality of second feature images comprises:

and extracting context features with different ranges from at least two of the third feature images by adopting the cavity convolution with depth separable convolution and convolution kernels corresponding to different cavity coefficients to obtain a plurality of second feature images.

4. The method according to any one of claims 1-3, wherein determining the target image based on at least the plurality of second feature images comprises:

at least fusing the plurality of second characteristic images to obtain a fourth characteristic image;

and determining the target image at least according to the fourth characteristic image.

5. The method according to claim 4, wherein said at least fusing the plurality of second feature images to obtain a fourth feature image comprises:

superposing the plurality of second characteristic images to obtain a fourth characteristic image; or

And superposing at least one of the plurality of second characteristic images and the plurality of third characteristic images to obtain the fourth characteristic image.

6. The method according to claim 4 or 5, wherein determining the target image based on at least the fourth feature image comprises:

up-sampling the fourth characteristic image to obtain the target image; alternatively, the first and second electrodes may be,

and performing sub-pixel convolution on the fourth characteristic image to obtain the target image.

7. The method according to claim 4 or 5, characterized in that the method further comprises:

performing feature extraction and dimension reduction on the image to be processed to obtain a fifth feature image; the number of layers of feature extraction corresponding to the fifth feature image is smaller than that of the layers of feature extraction corresponding to the first feature image;

determining the target image at least according to the fourth feature image, including:

under the condition that the times are smaller than the target times, the fourth feature image and the fifth feature image are overlapped and then are subjected to up-sampling, and the target image is obtained;

or the like, or, alternatively,

and under the condition that the times are smaller than the target times, superposing an image obtained by performing sub-pixel convolution on the fourth characteristic image and the fifth characteristic image to obtain the target image.

8. The method according to any one of claims 1-7, wherein the dimension corresponding to the target image obtained last time is a target dimension; wherein the target dimension is determined according to a total number of object classes included in the preset semantic image.

9. The method according to any one of claims 1-8, wherein after generating the semantic image corresponding to the image to be processed, the method further comprises:

and navigating the machine equipment according to the semantic image.

10. An apparatus for semantic segmentation of an image, the apparatus comprising:

the characteristic extraction module is used for extracting the characteristics of the acquired image to be processed to obtain a first characteristic image;

the context feature extraction module is used for synchronously extracting a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images;

the determining module is used for determining a target image at least according to the plurality of second characteristic images, and synchronously extracting a plurality of context characteristics with different ranges again by taking the target image as a new first characteristic image;

and the semantic image generating module is used for responding to the condition that the number of times of synchronously extracting the context features with different ranges from the first feature image reaches a target number of times, and generating a semantic image corresponding to the image to be processed based on the target image obtained at the last time.

11. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the image semantic segmentation method according to any one of the claims 1 to 9.

12. An image semantic segmentation apparatus, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the image semantic segmentation method of any one of claims 1-9.