TW202125408A

TW202125408A - Image semantic segmentation method, device and storage medium thereof

Info

Publication number: TW202125408A
Application number: TW109114127A
Authority: TW
Inventors: 張展鵬; 成慧; 張凱鵬
Original assignee: 中國商深圳市商湯科技有限公司
Priority date: 2019-12-30
Filing date: 2020-04-28
Publication date: 2021-07-01
Also published as: CN111179283A; JP2022518647A; KR20210088546A; WO2021134970A1; TWI728791B

Abstract

The embodiment of the present disclosure discloses an image semantic segmentation method, a device and a storage medium, wherein the method includes: performing feature extraction on the acquired image to be processed to obtain a first feature image; and Simultaneously extracting multiple context features with different ranges on the first feature image to obtain multiple second feature images; at least according to the multiple second feature images, determining a target image, wherein the target image as the new first feature image is synchronously extracted again context features with different ranges; In response to the number of simultaneous extraction of multiple context features with different ranges for the first feature image reaching the target number of times, a semantic image corresponding to the image to be processed is generated based on the target image obtained at the last time.

Description

Image semantic segmentation method, device and storage medium

本揭露涉及深度學習領域，尤其涉及圖像語義分割方法及裝置、儲存介質。The present disclosure relates to the field of deep learning, in particular to image semantic segmentation methods and devices, and storage media.

對於可移動的機器設備而言，可以針對其裝載的攝影機所採集的圖像進行語義分割，獲得對場景的語義理解，從而實現避障、導航等功能。For movable machinery and equipment, semantic segmentation can be performed on the images collected by the cameras mounted on it to obtain a semantic understanding of the scene, so as to realize functions such as obstacle avoidance and navigation.

目前，一方面，出於成本以及機動性能考慮，可移動的機器設備的計算資源往往比較受限。另一方面，可移動的機器設備需要即時地與現實環境進行交互。因此，如何在受限的計算資源下，進行即時語義分割，是挑戰性的技術問題。At present, on the one hand, due to cost and mobility considerations, the computing resources of movable machines are often limited. On the other hand, mobile machines and equipment need to interact with the real environment in real time. Therefore, how to perform real-time semantic segmentation under limited computing resources is a challenging technical problem.

本揭露實施例提供了一種圖像語義分割方法及裝置、儲存介質。The embodiments of the present disclosure provide an image semantic segmentation method, device, and storage medium.

根據本揭露實施例的第一方面，提供一種圖像語義分割方法，所述方法包括：對獲取到的待處理圖像進行特徵提取，獲得第一特徵圖像；對所述第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像；至少根據所述多個第二特徵圖像，確定目標圖像，並將所述目標圖像作為新的所述第一特徵圖像再次同步提取多個範圍不同的上下文特徵；響應於對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數，基於最後一次獲得的所述目標圖像，生成所述待處理圖像對應的語義圖像。According to a first aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation method, the method includes: performing feature extraction on an acquired image to be processed to obtain a first feature image; Synchronously extract multiple contextual features with different ranges to obtain multiple second feature images; at least according to the multiple second feature images, determine a target image, and use the target image as the new first The feature image is again synchronized to extract a plurality of context features with different ranges; in response to the number of times that the first feature image is synchronized to extract a plurality of context features with different ranges reaches the target number of times, based on the target image obtained last time, A semantic image corresponding to the image to be processed is generated.

在一些可選實施例中，所述對所述第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像，包括：對所述第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像；對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得多個第二特徵圖像。In some optional embodiments, the synchronously extracting a plurality of contextual features with different ranges from the first characteristic image to obtain a plurality of second characteristic images includes: dividing the first characteristic image into a plurality of Channels perform dimensionality reduction processing synchronously to obtain multiple third feature images; extract contextual features with different ranges from at least two third feature images in the multiple third feature images to obtain multiple second feature images picture.

在一些可選實施例中，所述對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得多個第二特徵圖像，包括：採用深度可分離卷積且卷積核對應不同空洞係數的空洞卷積，對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得所述多個第二特徵圖像。In some optional embodiments, the extracting context features with different ranges from at least two third feature images of the plurality of third feature images to obtain multiple second feature images includes: using depth The convolution can be separated and the convolution kernel corresponds to the hole convolution with different hole coefficients, and the context features with different ranges are extracted from at least two third feature images in the plurality of third feature images to obtain the plurality of first feature images. Two feature images.

在一些可選實施例中，所述至少根據所述多個第二特徵圖像，確定目標圖像，包括：至少將所述多個第二特徵圖像進行融合，獲得第四特徵圖像；至少根據所述第四特徵圖像，確定所述目標圖像。In some optional embodiments, the determining the target image at least according to the plurality of second characteristic images includes: fusing at least the plurality of second characteristic images to obtain a fourth characteristic image; The target image is determined based on at least the fourth characteristic image.

在一些可選實施例中，所述至少將所述多個第二特徵圖像進行融合，獲得第四特徵圖像，包括：將所述多個第二特徵圖像進行疊加，得到所述第四特徵圖像；或對所述多個第二特徵圖像和多個第三特徵圖像中的至少一個第三特徵圖像進行疊加，得到所述第四特徵圖像。In some optional embodiments, the fusing at least the plurality of second characteristic images to obtain a fourth characteristic image includes: superimposing the plurality of second characteristic images to obtain the first characteristic image. Four characteristic images; or superimposing at least one third characteristic image of the plurality of second characteristic images and the plurality of third characteristic images to obtain the fourth characteristic image.

在一些可選實施例中，所述至少根據所述第四特徵圖像，確定所述目標圖像，包括：對所述第四特徵圖像進行上採樣，獲得所述目標圖像；或者，對所述第四特徵圖像進行子像素卷積，獲得所述目標圖像。In some optional embodiments, the determining the target image at least according to the fourth characteristic image includes: up-sampling the fourth characteristic image to obtain the target image; or, Performing sub-pixel convolution on the fourth characteristic image to obtain the target image.

在一些可選實施例中，所述方法還包括：對所述待處理圖像進行特徵提取和降維處理後，獲得第五特徵圖像；其中，所述第五特徵圖像對應的特徵提取的層數小於所述第一特徵圖像對應的特徵提取的層數；所述至少根據所述第四特徵圖像，確定所述目標圖像，包括：在所述次數小於所述目標次數的情況下，將所述第四特徵圖像和所述第五特徵圖像疊加後進行上採樣，獲得所述目標圖像；或，在所述次數小於所述目標次數的情況下，對所述第四特徵圖像進行子像素卷積後得到的圖像與所述第五特徵圖像疊加，獲得所述目標圖像。In some optional embodiments, the method further includes: performing feature extraction and dimensionality reduction processing on the image to be processed to obtain a fifth feature image; wherein, feature extraction corresponding to the fifth feature image The number of layers is less than the number of layers of feature extraction corresponding to the first feature image; the determining the target image at least according to the fourth feature image includes: when the number of times is less than the target number of times In case, the fourth characteristic image and the fifth characteristic image are superimposed and then up-sampled to obtain the target image; or, in the case that the number of times is less than the target number of times, the The image obtained by sub-pixel convolution of the fourth characteristic image is superimposed on the fifth characteristic image to obtain the target image.

在一些可選實施例中，所述最後一次獲得的所述目標圖像對應的維度為目標維度；其中，所述目標維度是根據預設的所述語義圖像中所包括的物體類別的總數目確定的。In some optional embodiments, the dimension corresponding to the target image obtained last time is a target dimension; wherein, the target dimension is based on a preset total number of object categories included in the semantic image The project is determined.

在一些可選實施例中，所述生成所述待處理圖像對應的語義圖像之後，所述方法還包括：根據所述語義圖像進行機器設備導航。In some optional embodiments, after the semantic image corresponding to the image to be processed is generated, the method further includes: performing machine device navigation according to the semantic image.

根據本公開實施例的第二方面，提供一種圖像語義分割裝置，所述裝置包括：特徵提取模組，配置為對獲取到的待處理圖像進行特徵提取，獲得第一特徵圖像；上下文特徵提取模組，配置為對所述第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像；確定模組，配置為至少根據所述多個第二特徵圖像，確定目標圖像，並將所述目標圖像作為新的所述第一特徵圖像再次同步提取多個範圍不同的上下文特徵；語義圖像生成模組，配置為回應於對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數，基於最後一次獲得的所述目標圖像，生成所述待處理圖像對應的語義圖像。According to a second aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation device, the device comprising: a feature extraction module configured to perform feature extraction on the acquired image to be processed to obtain a first feature image; context The feature extraction module is configured to simultaneously extract a plurality of contextual features with different ranges from the first feature image to obtain a plurality of second feature images; the determining module is configured to at least according to the plurality of second feature maps Image, determine the target image, and use the target image as the new first feature image to synchronously extract a plurality of contextual features with different ranges; the semantic image generation module is configured to respond to the first feature image The number of times that a feature image synchronously extracts a plurality of context features with different ranges reaches a target number of times, and a semantic image corresponding to the image to be processed is generated based on the target image obtained last time.

在一些可選實施例中，所述上下文特徵提取模組包括：第一處理子模組，配置為對所述第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像；第二處理子模組，配置為對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得多個第二特徵圖像。In some optional embodiments, the context feature extraction module includes: a first processing sub-module configured to simultaneously perform dimensionality reduction processing on the first feature image in multiple channels to obtain multiple third features Image; a second processing sub-module configured to extract context features with different ranges from at least two third feature images in the plurality of third feature images to obtain a plurality of second feature images.

在一些可選實施例中，所述第二處理子模組，配置為採用深度可分離卷積且卷積核對應不同空洞係數的空洞卷積，對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得所述多個第二特徵圖像。In some optional embodiments, the second processing sub-module is configured to adopt a hole convolution with a depth separable convolution and a convolution kernel corresponding to different hole coefficients, and perform the At least two third feature images extract context features with different ranges to obtain the multiple second feature images.

在一些可選實施例中，所述確定模組包括：第一確定子模組，配置為至少將所述多個第二特徵圖像進行融合，獲得第四特徵圖像；第二確定子模組，配置為至少根據所述第四特徵圖像，確定所述目標圖像。In some optional embodiments, the determining module includes: a first determining sub-module configured to fuse at least the plurality of second characteristic images to obtain a fourth characteristic image; and a second determining sub-module The group is configured to determine the target image at least according to the fourth characteristic image.

在一些可選實施例中，所述第一確定子模組，配置為將所述多個第二特徵圖像進行疊加，得到所述第四特徵圖像；或對所述多個第二特徵圖像和多個第三特徵圖像中的至少一個第三特徵圖像進行疊加，得到所述第四特徵圖像。In some optional embodiments, the first determining sub-module is configured to superimpose the plurality of second characteristic images to obtain the fourth characteristic image; or to compare the plurality of second characteristic images The image and at least one third characteristic image of the plurality of third characteristic images are superimposed to obtain the fourth characteristic image.

在一些可選實施例中，所述第二確定子模組，配置為對所述第四特徵圖像進行上採樣，獲得所述目標圖像；或者，對所述第四特徵圖像進行子像素卷積，獲得所述目標圖像。In some optional embodiments, the second determining sub-module is configured to perform up-sampling on the fourth characteristic image to obtain the target image; or, perform sub-modules on the fourth characteristic image. Pixel convolution to obtain the target image.

在一些可選實施例中，所述裝置還包括：處理模組，配置為對所述待處理圖像進行特徵提取和降維處理後，獲得第五特徵圖像；其中，所述第五特徵圖像對應的特徵提取的層數小於所述第一特徵圖像對應的特徵提取的層數；所述第二確定子模組，配置為在所述次數小於所述目標次數的情況下，將所述第四特徵圖像和所述第五特徵圖像疊加後進行上採樣，獲得所述目標圖像；或者，在所述次數小於所述目標次數的情況下，對所述第四特徵圖像進行子像素卷積後得到的圖像與所述第五特徵圖像疊加，獲得所述目標圖像。In some optional embodiments, the device further includes: a processing module configured to perform feature extraction and dimensionality reduction processing on the image to be processed to obtain a fifth feature image; wherein, the fifth feature The number of feature extraction layers corresponding to the image is less than the number of feature extraction layers corresponding to the first feature image; the second determining sub-module is configured to: if the number of times is less than the target number of times, The fourth characteristic image and the fifth characteristic image are superimposed and then up-sampling is performed to obtain the target image; or, when the number of times is less than the target number of times, the fourth characteristic image is The image obtained after sub-pixel convolution is superimposed on the fifth characteristic image to obtain the target image.

在一些可選實施例中，所述裝置還包括：導航模組，配置為根據所述語義圖像進行機器設備導航。In some optional embodiments, the device further includes a navigation module configured to perform machine equipment navigation according to the semantic image.

根據本揭露實施例的協力廠商面，提供一種電腦可讀儲存介質，所述儲存介質儲存有電腦程式，所述電腦程式用於執行上述第一方面任一所述的圖像語義分割方法。According to a third-party aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, the storage medium stores a computer program, and the computer program is used to execute the image semantic segmentation method of any one of the above-mentioned first aspects.

根據本揭露實施例的第四方面，提供一種圖像語義分割裝置，包括：處理器；用於儲存所述處理器可執行指令的記憶體；其中，所述處理器被配置為調用所述記憶體中儲存的可執行指令，實現第一方面中任一項所述的圖像語義分割方法。According to a fourth aspect of the embodiments of the present disclosure, there is provided an image semantic segmentation device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the memory The executable instructions stored in the body implement the image semantic segmentation method described in any one of the first aspect.

根據本揭露實施例的第五方面，提供一種電腦程式，所述電腦程式使得電腦執行本揭露實施例第一方面中任一項所述的圖像語義分割方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program that enables a computer to execute the image semantic segmentation method according to any one of the first aspects of the embodiments of the present disclosure.

本揭露實施例提供的技術方案可以包括以下有益效果：The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:

本揭露實施例中，可以對獲取到的待處理圖像進行特徵提取，從而獲得第一特徵圖像，進而對第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像。至少根據多個第二特徵圖像，確定目標圖像，並將該目標圖像作為新的第一特徵圖像，再次同步提取多個範圍不同的上下文特徵。在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數時，可以基於最後一次獲得的目標圖像，語義分割生成該待處理圖像對應的語義圖像。本公開實施例透過多次對待處理圖像對應的特徵圖像同步提取多個範圍不同的上下文特徵，充分融合不同尺度的上下文資訊，提高了語義分割的精度。In the embodiment of the present disclosure, feature extraction can be performed on the acquired image to be processed to obtain a first feature image, and then multiple context features with different ranges are simultaneously extracted from the first feature image to obtain multiple second features image. At least according to the multiple second feature images, the target image is determined, and the target image is used as the new first feature image, and multiple context features with different ranges are synchronously extracted again. When the number of times of synchronously extracting a plurality of context features with different ranges from the first feature image reaches the target number of times, semantic segmentation may be performed based on the target image obtained last time to generate a semantic image corresponding to the image to be processed. In the embodiments of the present disclosure, multiple contextual features with different ranges are synchronously extracted by feature images corresponding to images to be processed multiple times, and context information of different scales is fully integrated, thereby improving the accuracy of semantic segmentation.

本揭露實施例中，可以先對第一特徵圖像分多個通道進行降維處理，獲得多個第三特徵圖像，再對多個第三特徵圖像中的至少兩個提取範圍不同的上下文特徵，獲得所對應的多個第二特徵圖像。實現了對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的目的，有利於提高語義分割的準確性，以及減少了語義分割過程的計算量。In the disclosed embodiment, the first feature image may be divided into multiple channels to perform dimensionality reduction processing to obtain multiple third feature images, and then at least two of the multiple third feature images with different extraction ranges Context feature, and obtain multiple corresponding second feature images. The purpose of synchronously extracting a plurality of contextual features with different ranges from the first feature image is realized, which is beneficial to improve the accuracy of semantic segmentation and reduces the amount of calculation in the semantic segmentation process.

本揭露實施例中，可以採用深度可分離卷積且卷積核對應不同空洞係數的空洞卷積，對多個第三特徵圖像中的至少兩個提取範圍不同的上下文特徵，實現了對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的目的，有利於提高語義分割的準確性。In the embodiment of the disclosure, a hole convolution with a depth separable convolution and a convolution kernel corresponding to different hole coefficients can be used to extract at least two context features with different ranges in a plurality of third feature images. The purpose of synchronously extracting multiple contextual features with different ranges in the first feature image is beneficial to improve the accuracy of semantic segmentation.

本揭露實施例中，可以直接將多個第二特徵圖像疊加得到第四特徵圖像，或者還可以將多個第二特徵圖像和多個第三特徵圖像中的至少一個進行疊加，獲得第四特徵圖像，可用性高，可融合更多尺度的資訊，提高了進行語義分割的準確性。In the embodiment of the present disclosure, multiple second feature images can be directly superimposed to obtain a fourth feature image, or at least one of multiple second feature images and multiple third feature images can also be superimposed, Obtaining the fourth feature image has high usability, can fuse information of more scales, and improve the accuracy of semantic segmentation.

本揭露實施例中，為了維持目標圖像的維度，可以對第四特徵圖像進行上採樣，從而得到目標圖像。或者還可以對第四特徵圖像進行子像素卷積，提高語義分割的效果，讓語義分割結果更加準確。In the embodiment of the present disclosure, in order to maintain the dimensionality of the target image, the fourth characteristic image may be up-sampled to obtain the target image. Or, sub-pixel convolution can be performed on the fourth feature image to improve the effect of semantic segmentation and make the result of semantic segmentation more accurate.

本揭露實施例中，可以在確定目標圖像之前，獲取第五特徵圖像。其中，第五特徵圖像是對待處理圖像提取低維圖像特徵所得到的圖像。所述第五特徵圖像對應的特徵提取的層數小於所述第一特徵圖像對應的特徵提取的層數。將第四特徵圖像和第五特徵圖像疊加後進行上採樣，獲得所述目標圖像，在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數的情況下，可以只對第四特徵圖像進行上採樣，獲得目標圖像，降低在降維處理後丟掉待處理圖像中的某些重要特徵的可能性，提高了語義分割的準確性。In the embodiment of the present disclosure, the fifth characteristic image may be acquired before the target image is determined. Among them, the fifth feature image is an image obtained by extracting low-dimensional image features from the image to be processed. The number of feature extraction layers corresponding to the fifth feature image is smaller than the number of feature extraction layers corresponding to the first feature image. The fourth feature image and the fifth feature image are superimposed and then up-sampled to obtain the target image. In the case that the number of times of synchronously extracting a plurality of context features with different ranges from the first feature image reaches the target number of times, Only the fourth feature image can be up-sampled to obtain the target image, which reduces the possibility of losing some important features in the image to be processed after dimensionality reduction processing, and improves the accuracy of semantic segmentation.

本揭露實施例中，最後一次獲得的目標圖像的維度為目標維度，其中，目標維度是根據預設的所述語義圖像中所包括的物體類別的總數目確定的。確保最終得到的語義圖像的維度與待處理圖像維度一致。In the disclosed embodiment, the dimension of the target image obtained last time is the target dimension, where the target dimension is determined according to the preset total number of object categories included in the semantic image. Ensure that the dimension of the final semantic image is consistent with the dimension of the image to be processed.

本揭露實施例中，可以根據生成的待處理圖像對應的語義圖像進行機器設備導航，可用性高。In the embodiments of the present disclosure, machine and equipment navigation can be performed according to the semantic image corresponding to the generated image to be processed, and the usability is high.

應當理解的是，以上的一般描述和後文的細節描述僅是示例性和解釋性的，並不能限制本揭露實施例。【圖式簡單發明】It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the embodiments of the present disclosure. [Schematic simple invention]

此處的附圖被併入說明書中並構成本說明書的一部分，示出了符合本揭露的實施例，並與說明書一起用於解釋本揭露的原理。第1A圖是本揭露根據一示例性實施例示出的一種顏色圖像；第1B圖是本揭露根據一示例性實施例示出的一種語義圖像；第2圖是本揭露根據一示例性實施例示出的一種圖像語義分割方法流程圖；第3圖是本揭露根據一示例性實施例示出的另一種圖像語義分割方法流程圖；第4圖是本揭露根據一示例性實施例示出的一種進行不同範圍上下文特徵提取的場景示意圖；第5圖是本揭露根據一示例性實施例示出的另一種圖像語義分割方法流程圖；第6圖是本揭露根據一示例性實施例示出的另一種圖像語義分割方法流程圖；第7圖是本揭露根據一示例性實施例示出的一種獲得語義圖像的神經網路架構示意圖；第8A圖是本揭露根據一示例性實施例示出的一種後端子網路的架構示意圖；第8B圖是本揭露根據一示例性實施例示出的另一種後端子網路的架構示意圖；第8C圖是本揭露根據一示例性實施例示出的另一種後端子網路的架構示意圖；第8D圖是本揭露根據一示例性實施例示出的另一種後端子網路的架構示意圖；第9圖是本揭露根據一示例性實施例示出的又一種圖像語義分割方法流程圖；第10圖是本揭露根據一示例性實施例示出的一種圖像語義分割裝置框圖；第11圖是本揭露根據一示例性實施例示出的一種用於圖像語義分割裝置的一結構示意圖。The drawings here are incorporated into the specification and constitute a part of the specification, show embodiments that comply with the disclosure, and are used together with the specification to explain the principle of the disclosure. FIG. 1A is a color image according to an exemplary embodiment of the present disclosure; Figure 1B is a semantic image according to an exemplary embodiment of the present disclosure; Figure 2 is a flow chart of a method for image semantic segmentation according to an exemplary embodiment of the present disclosure; FIG. 3 is a flowchart of another image semantic segmentation method according to an exemplary embodiment of the present disclosure; FIG. 4 is a schematic diagram of a scene for extracting context features in different ranges according to an exemplary embodiment of the present disclosure; FIG. 5 is a flowchart of another image semantic segmentation method according to an exemplary embodiment of the present disclosure; Fig. 6 is a flowchart of another image semantic segmentation method according to an exemplary embodiment of the present disclosure; FIG. 7 is a schematic diagram of a neural network architecture for obtaining semantic images according to an exemplary embodiment of the present disclosure; FIG. 8A is a schematic structural diagram of a rear terminal network according to an exemplary embodiment of the present disclosure; FIG. 8B is a schematic structural diagram of another rear terminal network according to an exemplary embodiment of the present disclosure; FIG. 8C is a schematic structural diagram of another rear terminal network according to an exemplary embodiment of the present disclosure; FIG. 8D is a schematic structural diagram of another rear terminal network according to an exemplary embodiment of the present disclosure; FIG. 9 is a flowchart of still another image semantic segmentation method according to an exemplary embodiment of the present disclosure; Fig. 10 is a block diagram of an image semantic segmentation device according to an exemplary embodiment of the present disclosure; FIG. 11 is a schematic structural diagram of a device for image semantic segmentation according to an exemplary embodiment of the present disclosure.

具體實施方式Detailed ways

這裡將詳細地對示例性實施例進行說明，其示例表示在附圖中。下面的描述涉及附圖時，除非另有表示，不同附圖中的相同數字表示相同或相似的要素。以下示例性實施例中所描述的實施方式並不代表與本揭露相一致的所有實施方式。相反，它們僅是與如所附發明申請專利範圍中所詳述的、本揭露的一些方面相一致的裝置和方法的例子。The exemplary embodiments will be described in detail here, and examples thereof are shown in the accompanying drawings. When the following description refers to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are only examples of devices and methods consistent with some aspects of the present disclosure as detailed in the scope of the appended invention application.

在本揭露運行的術語是僅僅出於描述特定實施例的目的，而非旨在限制本揭露。在本揭露和所附發明申請專利範圍中所運行的單數形式的「一種」、「所述」和「該」也旨在包括多數形式，除非上下文清楚地表示其他含義。還應當理解，本文中運行的術語“和/或”是指並包含一個或多個相關聯的列出專案的任何或所有可能組合。The terminology used in the present disclosure is only for the purpose of describing specific embodiments, and is not intended to limit the present disclosure. The singular forms of "a", "the" and "the" used in the scope of the present disclosure and the appended invention applications are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to and includes any or all possible combinations of one or more associated listed items.

應當理解，儘管在本揭露可能採用術語第一、第二、第三等來描述各種資訊，但這些資訊不應限於這些術語。這些術語僅用來將同一類型的資訊彼此區分開。例如，在不脫離本揭露範圍的情況下，第一資訊也可以被稱為第二資訊，類似地，第二資訊也可以被稱為第一資訊。取決於語境，如在此所運行的詞語「如果」可以被解釋成為「在……時」或「當……時」或「回應於確定」。It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of this disclosure, the first information can also be referred to as second information, and similarly, the second information can also be referred to as first information. Depending on the context, the word "if" as used here can be interpreted as "when" or "when" or "responding to certainty".

本揭露實施例提供了一種圖像語義分割方法，可以用於機器設備，例如機器人、無人駕駛車輛、無人機等可移動的機器設備。或者，可透過處理器運行電腦可執行代碼的方式實現本揭露實施例提供的方法。The embodiment of the present disclosure provides a method for image semantic segmentation, which can be used in machine equipment, such as mobile machines and equipment such as robots, unmanned vehicles, and drones. Alternatively, the method provided in the embodiments of the present disclosure may be implemented by running a computer executable code through a processor.

圖像語義分割是指對輸入的紅綠藍（Red Green Blue，RGB）圖像中每個像素點，估計其所屬的物體種類，該物體種類可以包括但不限於各種物體，例如草地、人、車、建築物、天空等，得到與RGB圖像對應的尺寸和維度相同的帶有所屬物體種類標籤的語義圖。例如第1A圖是RGB圖像，第1B圖是對應的語義圖像。Image semantic segmentation refers to the estimation of the type of object to which each pixel in the input red, green and blue (RGB) image belongs. This type of object can include but is not limited to various objects, such as grass, people, For cars, buildings, sky, etc., a semantic map with the same object type label with the same size and dimension corresponding to the RGB image is obtained. For example, Figure 1A is an RGB image, and Figure 1B is a corresponding semantic image.

本揭露實施例透過對該機器設備獲取到的待處理圖像進行特徵提取，得到第一特徵圖像；進而對第一特徵圖像分多次同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像，從而至少根據多個第二特徵圖像確定目標圖像；最終可以基於最後一次獲得的目標圖像，生成語義圖像。本揭露實施例透過多次上下文特徵提取及融合，能充分融合不同尺度的上下文資訊，提高了語義分割的精度。機器設備可以根據該待處理圖像對應的語義圖像，對該機器設備前方的障礙物進行規避，合理規劃行駛路線，可用性高。The embodiment of the disclosure obtains the first characteristic image by extracting the features of the image to be processed obtained by the machine and equipment; and then simultaneously extracting multiple context features with different ranges from the first characteristic image multiple times to obtain multiple The second feature image, so that the target image is determined at least according to the multiple second feature images; finally, a semantic image can be generated based on the target image obtained last time. The disclosed embodiment can fully integrate context information of different scales through multiple context feature extraction and fusion, and improve the accuracy of semantic segmentation. The machine equipment can avoid obstacles in front of the machine equipment according to the semantic image corresponding to the image to be processed, reasonably plan the driving route, and have high availability.

以上僅為本揭露示例性的應用場景，其他可以用到本揭露的圖像語義分割方法的場景均屬於本揭露的保護範圍。The above are only exemplary application scenarios of the present disclosure, and other scenarios where the image semantic segmentation method of the present disclosure can be used belong to the protection scope of the present disclosure.

如第2圖所示，第2圖是根據一示例性實施例示出的一種圖像語義分割方法，包括以下步驟：As shown in Figure 2, Figure 2 is an image semantic segmentation method according to an exemplary embodiment, which includes the following steps:

在步驟101中，對獲取到的待處理圖像進行特徵提取，獲得第一特徵圖像。In step 101, feature extraction is performed on the acquired image to be processed to obtain a first feature image.

本揭露實施例中，待處理圖像可以是即時圖像，即時圖像可以透過該機器設備上預先設置的攝影機進行圖像採集，採集的圖像中可包括位於該機器設備移動路線前方的各種物體。待處理圖像還可以是該機器設備已經採集到的圖像（例如機器設備中儲存的圖像），或者其他設備發送給該機器設備的、需要進行語義分割的圖像。In the disclosed embodiment, the image to be processed may be a real-time image. The real-time image can be collected through a camera preset on the machine and equipment, and the collected images can include various moving routes in front of the machine and equipment. object. The image to be processed may also be an image that has been collected by the machine equipment (for example, an image stored in the machine equipment), or an image that needs to be semantically segmented and sent to the machine equipment by other equipment.

將待處理圖像中包括的原始圖像資訊轉換為一組具有明顯物理意義或者統計意義的特徵，從而可以得到第一特徵圖像；或者可以透過卷積網路，例如殘差網路（Residual Networks，ResNet）、視覺幾何組（Visual Geometry Group，VGG）網路等方式從待處理圖像中提取出高維圖像特徵，得到第一特徵圖像。Convert the original image information included in the image to be processed into a set of features with obvious physical or statistical significance, so that the first feature image can be obtained; or it can be through a convolutional network, such as a residual network (Residual Networks, ResNet), Visual Geometry Group (VGG) network and other methods extract high-dimensional image features from the image to be processed to obtain the first feature image.

其中，在一些實施例中，對待處理圖像進行特徵提取時，可以提取出待處理圖像中例如哈爾特徵（Haar-like features，Haar）、局部二值模式（Local Binary Pattern，LBP）、方向梯度長條圖（Histogram of Oriented Gradient，HOG）等特徵。哈爾特徵描述的是圖像在局部範圍內像素值明暗變換資訊，LBP描述的是圖像在局部範圍內對應的紋理資訊，HOG描述的則是圖像在局部範圍內對應的形狀邊緣梯度資訊。或者，在另一些實施例中，對待處理圖像進行特徵提取時，可以提取出待處理圖像的高維視覺特徵。Among them, in some embodiments, when feature extraction is performed on the image to be processed, such as Haar-like features (Haar), Local Binary Pattern (LBP), and LBP can be extracted from the image to be processed. Histogram of Oriented Gradient (HOG) and other features. The Haar feature describes the light-dark transformation information of the pixel value in the local area of the image, LBP describes the texture information corresponding to the image in the local area, and HOG describes the edge gradient information of the image in the local area. . Or, in other embodiments, when performing feature extraction on the image to be processed, high-dimensional visual features of the image to be processed can be extracted.

在步驟102中，對所述第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像。In step 102, multiple contextual features with different ranges are synchronously extracted from the first feature image to obtain multiple second feature images.

在本揭露實施例中，上下文特徵提取是對第一特徵圖像中像素點鄰域內的其他像素點的分佈情況進行的統計。In this embodiment of the disclosure, the context feature extraction is to perform statistics on the distribution of other pixels in the neighborhood of the pixel in the first feature image.

範圍不同的上下文特徵提取是指間隔不同的像素數目進行的上下文特徵提取，例如在對第一特徵圖像進行上下文特徵提取時，可以同步對第一特徵圖像所包括的多個像素點間隔（例如3個、7個、12個像素點間隔）進行上下文特徵提取，分別得到多個第二特徵圖像。Contextual feature extraction with different ranges refers to contextual feature extraction with different pixel numbers. For example, when contextual feature extraction is performed on the first feature image, the multiple pixel intervals included in the first feature image can be synchronized ( For example, 3 pixels, 7 pixels, 12 pixels intervals) perform context feature extraction to obtain multiple second feature images respectively.

在步驟103中，至少根據所述多個第二特徵圖像，確定目標圖像，並將所述目標圖像作為新的所述第一特徵圖像再次同步提取多個範圍不同的上下文特徵。In step 103, a target image is determined according to at least the plurality of second characteristic images, and the target image is used as the new first characteristic image to synchronously extract a plurality of contextual features with different ranges.

在本揭露實施例中，目標圖像是每次至少根據多個第二特徵圖像獲得的圖像。在確定了目標圖像之後，該目標圖像可以作為新的第一特徵圖像，再次返回執行步驟102。In the embodiment of the present disclosure, the target image is an image obtained at least according to a plurality of second characteristic images each time. After the target image is determined, the target image can be used as a new first characteristic image, and step 102 is executed again.

在步驟104中，響應於對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數，基於最後一次獲得的所述目標圖像，生成所述待處理圖像對應的語義圖像。In step 104, in response to the number of times of synchronously extracting a plurality of contextual features in different ranges from the first feature image reaching a target number of times, based on the target image obtained last time, a corresponding image corresponding to the to-be-processed image is generated. Semantic image.

在本揭露實施例中，目標次數可以是大於等於2的正整數。In the embodiment of the present disclosure, the target number of times may be a positive integer greater than or equal to 2.

上述實施例中，可以對獲取到的待處理圖像進行特徵提取，從而獲得第一特徵圖像，進而對第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像。至少根據多個第二特徵圖像，確定目標圖像，並將該目標圖像作為新的第一特徵圖像，再次同步提取多個範圍不同的上下文特徵。在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數時，可以基於最後一次獲得的目標圖像，語義分割生成該待處理圖像對應的語義圖像。本揭露實施例，透過多次對待處理圖像對應的特徵圖像同步提取多個範圍不同的上下文特徵，能充分融合不同尺度的上下文資訊，提高了語義分割的精度。In the above embodiment, feature extraction can be performed on the acquired image to be processed to obtain a first feature image, and then multiple context features with different ranges can be simultaneously extracted from the first feature image to obtain multiple second feature maps picture. At least according to the multiple second feature images, the target image is determined, and the target image is used as the new first feature image, and multiple context features with different ranges are synchronously extracted again. When the number of times of synchronously extracting a plurality of context features with different ranges from the first feature image reaches the target number of times, semantic segmentation may be performed based on the target image obtained last time to generate a semantic image corresponding to the image to be processed. In the disclosed embodiment, multiple contextual features with different ranges are simultaneously extracted through feature images corresponding to images to be processed multiple times, so that context information of different scales can be fully integrated, and the accuracy of semantic segmentation is improved.

在一些可選實施例中，針對步驟101，可以採用特徵提取網路，將採集到的待處理圖像輸入該特徵提取網路，從而由該特徵提取網路輸出第一特徵圖像。其中，特徵提取網路可以是Resnet、VGG等可以進行特徵提取的神經網路。In some optional embodiments, for step 101, a feature extraction network may be used to input the collected image to be processed into the feature extraction network, so that the feature extraction network outputs the first feature image. Among them, the feature extraction network can be a neural network that can perform feature extraction, such as Resnet and VGG.

在一些可選實施例中，例如圖3所示，步驟102可以包括：In some optional embodiments, such as shown in FIG. 3, step 102 may include:

在步驟102-1中，對所述第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像。In step 102-1, the first feature image is divided into multiple channels to simultaneously perform dimensionality reduction processing to obtain multiple third feature images.

本揭露實施例中，對第一特徵圖像進行降維處理是為了後續更好地進行上下文特徵提取，有利於減少後續處理的計算量。對第一特徵圖像分多個通道進行降維處理，後續可以針對多個通道對應的降維處理後的圖像分別提取範圍不同的上下文特徵，有利於提高語義分割的準確性，以及減少了語義分割過程的計算量。In the embodiment of the disclosure, the dimensionality reduction processing on the first feature image is for better context feature extraction in the subsequent process, which is beneficial to reduce the calculation amount of subsequent processing. The first feature image is divided into multiple channels to perform dimensionality reduction processing, and then the dimensionality reduction processed images corresponding to multiple channels can be used to extract contextual features with different ranges, which is conducive to improving the accuracy of semantic segmentation and reducing The amount of calculation in the semantic segmentation process.

在本揭露實施例中，可以對第一特徵圖像分多個通道同步進行相同維度的降維處理，例如第4圖所示，採用1×1卷積核的卷積層進行多通道降維處理後得到的多個第三特徵圖像的維度可以是1×1×256維。In the disclosed embodiment, the first feature image can be divided into multiple channels to simultaneously perform dimensionality reduction processing of the same dimension. For example, as shown in Figure 4, the convolutional layer of the 1×1 convolution kernel is used for multi-channel dimensionality reduction processing. The dimensions of the plurality of third feature images obtained later may be 1×1×256 dimensions.

在步驟102-2中，對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得多個第二特徵圖像。In step 102-2, extracting context features with different ranges from at least two third feature images in the plurality of third feature images, to obtain a plurality of second feature images.

在本揭露實施例中，可以採用深度可分離卷積且卷積核對應不同空洞係數的空洞卷積，對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得所述多個第二特徵圖像，實現了對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的目的，有利於提高語義分割的準確性。其中，空洞卷積可以選擇3×3大小的卷積核，也可以採用5×5或者7×7等大小的卷積核，本揭露實施例中對空洞卷積的卷積核大小不做限定。其中，可以根據語義分割的場景將空洞卷積的空洞係數r設置為不同的值，例如可以設置r為6、12、18、32等，根據r的值可以間隔不同的像素點數目進行上下文特徵提取。In the disclosed embodiment, a hole convolution with a depth separable convolution and a convolution kernel corresponding to different hole coefficients can be used to extract contexts with different ranges for at least two third characteristic images among the plurality of third characteristic images. Features, obtaining the multiple second feature images, achieves the purpose of extracting multiple context features with different ranges from the first feature images simultaneously, which is beneficial to improving the accuracy of semantic segmentation. Among them, the hole convolution can choose a convolution kernel with a size of 3×3, or a convolution kernel with a size of 5×5 or 7×7. The size of the convolution kernel of the hole convolution is not limited in the embodiment of this disclosure. . Among them, the hole coefficient r of the hole convolution can be set to different values according to the semantic segmentation scene, for example, r can be set to 6, 12, 18, 32, etc., according to the value of r, different numbers of pixels can be separated for contextual features extract.

例如第4圖所示，在對第一特徵圖像進行4個通道的降維處理後，獲得4個第三特徵圖像，分別記為第三特徵圖像1至第三特徵圖像4；可以對第三特徵圖像1不進行上下文特徵提取，第三特徵圖像2、3、4分別對應的空洞係數r的值為6、12和18，即對第三特徵圖像2、3、4分別每間隔6個像素點、12個像素點和18個像素點，提取上下文特徵，得到三個第二特徵圖像。For example, as shown in Figure 4, after performing dimensionality reduction processing on the first feature image with 4 channels, 4 third feature images are obtained, which are respectively denoted as third feature image 1 to third feature image 4; The third feature image 1 may not be subjected to context feature extraction, and the values of the hole coefficient r corresponding to the third feature images 2, 3, and 4 are 6, 12, and 18 respectively, that is, for the third feature images 2, 3, and 4. 4 At intervals of 6 pixels, 12 pixels, and 18 pixels, respectively, extract the context feature to obtain three second feature images.

上述實施例中，可以先對第一特徵圖像分多個通道進行降維處理，獲得多個第三特徵圖像，再對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得所對應的至少兩個第二特徵圖像。實現了對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的目的，有利於提高語義分割的準確性，以及減少了語義分割過程的計算量。In the foregoing embodiment, the first feature image may be divided into multiple channels to perform dimensionality reduction processing to obtain multiple third feature images, and then at least two third feature images of the multiple third feature images Extracting contextual features with different ranges to obtain at least two corresponding second feature images. The purpose of synchronously extracting a plurality of contextual features with different ranges from the first feature image is realized, which is beneficial to improve the accuracy of semantic segmentation and reduces the amount of calculation in the semantic segmentation process.

在一些可選實施例中，例如第5圖所示，步驟103至少根據所述多個第二特徵圖像，確定目標圖像的過程可以包括：In some optional embodiments, for example, as shown in Figure 5, the process of determining the target image in step 103 at least based on the plurality of second characteristic images may include:

在步驟103-1中，至少將所述多個第二特徵圖像進行融合，獲得第四特徵圖像。In step 103-1, at least the plurality of second characteristic images are merged to obtain a fourth characteristic image.

在本揭露實施例中，至少可以將上述步驟獲得的多個第二特徵圖像進行疊加，從而獲得第四特徵圖像。In this embodiment of the disclosure, at least multiple second characteristic images obtained in the above steps may be superimposed to obtain a fourth characteristic image.

例如，將多個第二特徵圖像堆放到一起，隨後透過一個卷積操作實現多尺度上下文特徵的融合，得到第四特徵圖像。還可以將多個第二特徵圖像進行拼接，得到第四特徵圖像。For example, a plurality of second feature images are stacked together, and then a convolution operation is used to realize the fusion of multi-scale context features to obtain a fourth feature image. It is also possible to stitch a plurality of second characteristic images to obtain a fourth characteristic image.

在步驟103-2中，至少根據所述第四特徵圖像，確定所述目標圖像。In step 103-2, the target image is determined based on at least the fourth characteristic image.

在一種可能的實現方式中，可以直接將第四特徵圖像作為目標圖像。在另一種可能的實現方式中，可以對第四特徵圖像進行可以提高語義分割效果的處理，從而得到目標圖像。在另一種可能的實現方式中，還可以根據第四特徵圖像和其他與待處理圖像關聯的特徵圖像，確定目標圖像。In a possible implementation manner, the fourth characteristic image can be directly used as the target image. In another possible implementation manner, processing that can improve the semantic segmentation effect may be performed on the fourth feature image, so as to obtain the target image. In another possible implementation manner, the target image may also be determined according to the fourth characteristic image and other characteristic images associated with the image to be processed.

上述實施例中，至少可以根據多個第二特徵圖像確定目標圖像，可用性高。In the foregoing embodiment, at least the target image can be determined based on a plurality of second characteristic images, which is highly usable.

在一些可選實施例中，針對步驟103-1，在一種可能的實現方式中，可以將多個第二特徵圖像進行疊加，從而得到第四特徵圖像。為了更好地保留待處理圖像所對應的特徵資訊，提高語義分割的準確性，在另一種可能的實現方式中，還可以將多個第二特徵圖像和多個第三特徵圖像中的至少一個第三特徵圖像進行疊加，將疊加後得到的圖像作為第四特徵圖像。In some optional embodiments, for step 103-1, in a possible implementation manner, multiple second characteristic images may be superimposed to obtain a fourth characteristic image. In order to better retain the feature information corresponding to the image to be processed and improve the accuracy of semantic segmentation, in another possible implementation manner, multiple second feature images and multiple third feature images can also be combined The at least one third characteristic image of is superimposed, and the superimposed image is used as the fourth characteristic image.

多個第三特徵圖像就是對第一特徵圖像同步分多個通道進行降維處理後得到的圖像，本揭露實施例中，可以將多個第二特徵圖像和多個第三特徵圖像中的至少一個第三特徵圖像疊加，隨後透過一個卷積操作實現多尺度上下文特徵的融合，得到第四特徵圖像。還可以將多個第二特徵圖像和多個第三特徵圖像中的至少一個第三特徵圖像進行拼接，得到第四特徵圖像。The multiple third feature images are the images obtained after the first feature image is divided into multiple channels to perform dimensionality reduction processing. In the disclosed embodiment, multiple second feature images and multiple third feature images can be combined. At least one third feature image in the image is superimposed, and then a convolution operation is used to realize the fusion of multi-scale context features to obtain a fourth feature image. It is also possible to stitch together at least one third characteristic image among the plurality of second characteristic images and the plurality of third characteristic images to obtain a fourth characteristic image.

在上述實施例中，可以直接將多個第二特徵圖像疊加得到第四特徵圖像，或者還可以將多個第二特徵圖像和未進行上下文特徵提取的多個第三特徵圖像中的至少一個第三特徵圖像進行疊加，獲得第四特徵圖像，可用性高，可融合更多尺度的資訊，提高了進行語義分割的準確性。In the foregoing embodiment, multiple second feature images can be directly superimposed to obtain a fourth feature image, or multiple second feature images and multiple third feature images that have not been subjected to context feature extraction can also be combined. At least one of the third feature images is superimposed to obtain the fourth feature image, which has high usability, can fuse information of more scales, and improves the accuracy of semantic segmentation.

在一些可選實施例中，針對步驟103-2，可以採用以下方式中的任意一種確定目標圖像。In some optional embodiments, for step 103-2, any one of the following methods may be used to determine the target image.

在一種可能的實現方式中，所述至少根據所述第四特徵圖像，確定所述目標圖像，包括：對所述第四特徵圖像進行上採樣，獲得所述目標圖像。In a possible implementation manner, the determining the target image at least according to the fourth characteristic image includes: up-sampling the fourth characteristic image to obtain the target image.

在本揭露實施例中，由於目標圖像後續還需要進行降維處理或生成語義圖像，為了維持目標圖像的維度，需要對第四特徵圖像進行上採樣處理。在確定了第四特徵圖像之後，直接對第四特徵圖像進行上採樣處理（例如線性插值），從而得到目標圖像。進而將目標圖像作為新的第一特徵圖像，返回執行步驟102。In the embodiment of the present disclosure, since the target image needs to be subjected to dimensionality reduction processing or to generate a semantic image subsequently, in order to maintain the dimensionality of the target image, the fourth characteristic image needs to be up-sampled. After the fourth feature image is determined, the up-sampling process (for example, linear interpolation) is directly performed on the fourth feature image, so as to obtain the target image. Then, the target image is taken as the new first characteristic image, and step 102 is returned to.

對第四特徵圖像進行上採樣處理時，對應的上採樣因數t可以為2、4、8等，每次對所述第四特徵圖像進行上採樣處理時，可以採用相同或不同的上採樣因數。其中，上採樣因數是對原圖像進行放大時，在像素點之間採用合適的插值演算法插入新的像素點的數目，例如上採樣因數t為2時，可以在兩個相鄰像素點之間採用線性插值演算法插入2個新的像素點。When the fourth feature image is up-sampling, the corresponding up-sampling factor t can be 2, 4, 8, etc. Each time the fourth feature image is up-sampled, the same or different up-sampling can be used. Sampling factor. Among them, the up-sampling factor is the number of new pixels inserted between the pixels using a suitable interpolation algorithm when the original image is enlarged. For example, when the up-sampling factor t is 2, it can be set between two adjacent pixels. A linear interpolation algorithm is used to insert 2 new pixels.

在另一種可能的實現方式中，所述至少根據所述第四特徵圖像，確定所述目標圖像，包括：對所述第四特徵圖像進行子像素卷積，獲得所述目標圖像。In another possible implementation manner, the determining the target image at least according to the fourth characteristic image includes: performing sub-pixel convolution on the fourth characteristic image to obtain the target image .

子像素卷積透過對輸出的特徵圖深度方向的像素進行平鋪，使得特徵圖深度變小而二維平面的空間尺度變大，從而提高特徵圖的空間解析度。Sub-pixel convolution tiling the pixels in the depth direction of the output feature map makes the depth of the feature map smaller and the spatial scale of the two-dimensional plane larger, thereby improving the spatial resolution of the feature map.

透過對第四特徵圖像進行子像素卷積可以提高語義分割的效果，讓語義分割結果更加準確。子像素卷積處理之後，還可以進行上採樣處理，透過上採樣處理獲得目標圖像，進而將目標圖像作為新的第一特徵圖像，返回執行步驟102。By performing sub-pixel convolution on the fourth feature image, the effect of semantic segmentation can be improved, and the result of semantic segmentation can be more accurate. After the sub-pixel convolution processing, an up-sampling process can also be performed, the target image is obtained through the up-sampling process, and then the target image is used as a new first characteristic image, and step 102 is returned to.

在另一種可能的實現方式中，考慮到之前對第一特徵圖像進行了降維處理，後續的圖像都是基於根據降維處理後的多個第三特徵圖像得到的，但最終生成的語義圖像卻是與待處理圖像一樣的高維度的圖像，為了降低在降維處理後丟掉待處理圖像中的某些重要特徵的可能性，提高了語義分割的準確性，可以在確定目標圖像之前，獲取第五特徵圖像。In another possible implementation manner, considering that the first feature image was previously processed for dimensionality reduction, the subsequent images are all based on multiple third feature images after the dimensionality reduction process, but the final generation The semantic image is the same high-dimensional image as the image to be processed. In order to reduce the possibility of losing some important features in the image to be processed after dimensionality reduction processing, the accuracy of semantic segmentation can be improved. Before determining the target image, a fifth characteristic image is acquired.

其中，第五特徵圖像是對待處理圖像提取低維圖像特徵所得到的圖像。所述第五特徵圖像對應的特徵提取的層數小於所述第一特徵圖像對應的特徵提取的層數。例如，對待處理圖像進行了10層特徵提取，獲得第一特徵圖像，那麼可以將前4層進行特徵提取後得到的圖像作為第五特徵圖像。Among them, the fifth feature image is an image obtained by extracting low-dimensional image features from the image to be processed. The number of feature extraction layers corresponding to the fifth feature image is smaller than the number of feature extraction layers corresponding to the first feature image. For example, if 10 layers of feature extraction are performed on the image to be processed to obtain the first feature image, then the image obtained after feature extraction of the first 4 layers can be used as the fifth feature image.

相應地，例如第6圖所示，上述方法還可以包括：Correspondingly, for example, as shown in Figure 6, the above method may further include:

在步驟105中，對所述待處理圖像進行特徵提取和降維處理後，獲得第五特徵圖像。In step 105, after performing feature extraction and dimensionality reduction processing on the image to be processed, a fifth feature image is obtained.

本揭露實施例中，可以對第四特徵圖像和第五特徵圖像疊加後進行上採樣處理，得到目標圖像。In the embodiment of the present disclosure, the fourth characteristic image and the fifth characteristic image may be superimposed and then up-sampling processing may be performed to obtain the target image.

其中，在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數小於目標次數的情況下，步驟103-2可以將第四特徵圖像和第五特徵圖像疊加後進行上採樣，獲得所述目標圖像，在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數的情況下，可以只對第四特徵圖像進行上採樣，獲得目標圖像。將目標圖像作為新的第一特徵圖像，返回執行步驟102。每次進行上採樣處理時對應的上採樣因數t可以相同或不同。Wherein, in the case that the number of times of synchronously extracting a plurality of context features with different ranges from the first feature image is less than the target number of times, step 103-2 may superimpose the fourth feature image and the fifth feature image and then perform up-sampling. To obtain the target image, when the number of times of synchronously extracting a plurality of context features with different ranges from the first characteristic image reaches the target number of times, only the fourth characteristic image may be up-sampled to obtain the target image. The target image is taken as the new first characteristic image, and step 102 is returned to. The corresponding upsampling factor t during each upsampling process can be the same or different.

在另一種可能的實現方式中，同樣可以根據第四特徵圖像和第五特徵圖像確定目標圖像。In another possible implementation manner, the target image can also be determined based on the fourth characteristic image and the fifth characteristic image.

在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數小於目標次數的情況下，對所述第四特徵圖像進行子像素卷積後得到的圖像與所述第五特徵圖像疊加，獲得所述目標圖像。在對第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數的情況下，直接對所述第四特徵圖像進行子像素卷積，獲得目標圖像。In the case that the number of times of synchronously extracting a plurality of context features with different ranges from the first feature image is less than the target number of times, the image obtained after sub-pixel convolution on the fourth feature image is the same as the fifth feature image Image superimposition to obtain the target image. When the number of times of synchronously extracting a plurality of context features with different ranges from the first feature image reaches the target number of times, sub-pixel convolution is directly performed on the fourth feature image to obtain the target image.

在本揭露實施例中，為了確保語義分割的效果，同樣可以在對第一特徵圖像同步提取多個不同範圍的上下文特徵的次數小於目標次數的情況下，對第四特徵圖像先進行子像素卷積，將得到的圖像與第五特徵圖像疊加，獲得目標圖像。如果該次數達到目標次數，可以直接對第四特徵圖像進行子像素卷積，獲得目標圖像。其中，對第四特徵圖像進行子像素卷積後，還可以再進行上採樣。進而將目標圖像作為新的第一特徵圖像，返回執行步驟102。In the embodiment of the present disclosure, in order to ensure the effect of semantic segmentation, it is also possible to perform subtraction on the fourth feature image when the number of times of synchronously extracting multiple context features in different ranges from the first feature image is less than the target number of times. Pixel convolution, superimpose the obtained image with the fifth characteristic image to obtain the target image. If the number of times reaches the target number of times, the fourth feature image can be directly subjected to sub-pixel convolution to obtain the target image. Wherein, after sub-pixel convolution is performed on the fourth feature image, up-sampling may also be performed. Then, the target image is taken as the new first characteristic image, and step 102 is returned to.

需要說明的是，每次確定了目標圖像之後，將目標圖像作為新的第一特徵圖像再次同步提取多個範圍不同的上下文特徵時，對新的第一特徵圖像分多個通道同步進行降維處理時的得到的新的多個第三特徵圖像的維度，可以與之前進行降維處理後得到的多個第三特徵圖像的維度相同或不同。例如，上一次對第一特徵圖像分多個通道同步進行降維處理後得到1×1×256維度的多個第三特徵圖像，對新的第一特徵圖像分多個通道同步進行降維處理後可以得到1×1×128維度的新的多個第三特徵圖像。It should be noted that each time the target image is determined, the target image is used as a new first feature image to resynchronize and extract multiple contextual features with different ranges, and the new first feature image is divided into multiple channels The dimensions of the new multiple third feature images obtained when the dimensionality reduction processing is performed synchronously may be the same or different from the dimensions of the multiple third feature images obtained after the previous dimensionality reduction processing is performed. For example, the last time the first feature image was divided into multiple channels and the dimensionality reduction process was performed to obtain multiple third feature images of 1×1×256 dimensions, and the new first feature image was divided into multiple channels for synchronization. After the dimensionality reduction process, a plurality of new third feature images with 1×1×128 dimensions can be obtained.

另外，每次針對多個第三特徵圖像進行空洞卷積時的空洞係數也可以相同或不同。例如，上一次對多個第三特徵圖像中的至少兩個進行空洞卷積時，對應的空洞係數可以分別是6、12、18，對新的多個第三特徵圖像中的至少兩個進行空洞卷積時，對應的空洞係數可以分別是6和12。In addition, the hole coefficients when the hole convolution is performed on the plurality of third feature images each time may also be the same or different. For example, when the hole convolution was performed on at least two of the multiple third feature images last time, the corresponding hole coefficients may be 6, 12, and 18, respectively. For at least two of the new multiple third feature images, When a hole convolution is performed, the corresponding hole coefficients can be 6 and 12 respectively.

上述實施例中，至少可以根據第四特徵圖像，確定一個目標圖像，從而確保語義分割的精度和準確性，可用性高。In the above-mentioned embodiment, at least one target image can be determined based on the fourth characteristic image, so as to ensure the precision and accuracy of semantic segmentation and high usability.

在一些可選實施例中，為了確保最終得到的語義圖像的維度與待處理圖像維度一致，可以在輸出目標圖像之前，進行降維和/或升維處理，從而確保目標圖像對應的維度為目標維度。其中，目標維度是根據預設的所述語義圖像中所包括的物體類別的總數目確定的。In some optional embodiments, in order to ensure that the dimension of the finally obtained semantic image is consistent with the dimension of the image to be processed, dimensionality reduction and/or dimensionality can be performed before outputting the target image, so as to ensure that the target image corresponds to the The dimension is the target dimension. Wherein, the target dimension is determined according to the preset total number of object categories included in the semantic image.

例如，目標維度可以是1×1×16N，N是預設的所述語義圖像中所包括的物體類別的總數目。如果語義圖像中需要分析出4類物體類別，那麼目標維度可以是1×1×64。For example, the target dimension may be 1×1×16N, where N is the preset total number of object categories included in the semantic image. If 4 types of object categories need to be analyzed in the semantic image, the target dimension can be 1×1×64.

上述實施例中，最後一次獲得的目標圖像對應的維度可以與可以在輸出目標圖像之前，進行降維和/或升維處理（例如採用預設通道數的卷積層進行卷積操作），從而確保目標圖像的維度為目標維度關聯，提高了語義分割的準確性和精度。In the above-mentioned embodiment, the dimension corresponding to the target image obtained last time may be dimensionality reduction and/or dimension increase processing (for example, a convolutional layer with a preset number of channels is used for convolution operation) before the target image is output. Ensure that the dimensions of the target image are associated with the target dimensions, which improves the accuracy and precision of semantic segmentation.

在一些可選實施例中，針對步驟104，在最後一次獲得目標圖像之後，可以採用插值演算法生成該語義圖像，該插值演算法可以包括但不限於雙線性插值演算法。In some optional embodiments, for step 104, after obtaining the target image for the last time, an interpolation algorithm may be used to generate the semantic image, and the interpolation algorithm may include, but is not limited to, a bilinear interpolation algorithm.

對上述實施例進一步舉例說明如下，例如第7圖所示，採集到的待處理圖像（例如圖中所示的即時圖像）可以輸入一個全卷積的神經網路，由該全卷積的神經網路輸出對應的語義圖像。The above embodiment is further illustrated as follows. For example, as shown in Figure 7, the collected image to be processed (such as the real-time image shown in the figure) can be input into a fully convolutional neural network, and the fully convolution The neural network outputs the corresponding semantic image.

該全卷積的神經網路可以包括前端子網路和後端子網路。The fully convolutional neural network may include a front terminal network and a rear terminal network.

其中前端子網路可以是特徵提取網路，可以採用Resnet、VGG等神經網路。The front terminal network can be a feature extraction network, and neural networks such as Resnet and VGG can be used.

在對前端子網路進行訓練的過程中，可以採用人工標注的圖像分類樣本資料集合，例如ImageNet。ImageNet集合中包括了圖像和對應的圖像特徵標籤，透過調整前端子網路的網路參數，讓前端子網路輸出結果與ImageNet樣本集合中的標籤內容匹配或者在容錯範圍內。In the process of training the front terminal network, manually labeled image classification sample data sets, such as ImageNet, can be used. The ImageNet collection includes images and corresponding image feature tags. By adjusting the network parameters of the front terminal network, the output result of the front terminal network can match the label content in the ImageNet sample collection or be within the tolerance range.

透過前端子網路可以獲得該待處理圖像對應的第一特徵圖像，進一步地將第一特徵圖像輸入後端子網路，獲得該後端子網路輸出的語義圖像。The first characteristic image corresponding to the image to be processed can be obtained through the front terminal network, and the first characteristic image is further input into the rear terminal network to obtain the semantic image output by the rear terminal network.

在對後端子網路進行訓練時，可以採用人工標注的圖像語義分割樣本集合，例如CityScapes，透過後向傳播演算法訓練整個神經網路的網路參數，包括前端子網路和後端子網路的網路參數，讓後端子網路輸出結果與CityScapes樣本集合中的標籤內容匹配或者在容錯範圍內。When training the back-terminal network, you can use artificially annotated image semantic segmentation sample collections, such as CityScapes, and train the network parameters of the entire neural network through the backward propagation algorithm, including the front-terminal network and the back-terminal network The network parameters of the road, so that the output result of the back terminal network matches the label content in the CityScapes sample set or is within the fault tolerance range.

為了方便介紹後端子網路所採用的網路架構，本揭露實施例中僅以目標次數為2進行舉例說明，應當注意地是，目標次數為其他大於2的正整數值時均屬於本揭露的保護範圍。In order to facilitate the introduction of the network architecture adopted by the rear terminal network, in the embodiment of this disclosure, only the target number of times is 2 for illustration. It should be noted that any positive integer value greater than 2 belongs to the disclosure of the target number of times. protected range.

在一種可能地實現方式中，後端子網路的網路架構可以如第8A圖所示。In a possible implementation, the network architecture of the back terminal network can be as shown in Figure 8A.

透過子網路1，先對第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，可以透過深度可分離卷積且卷積核對應不同空洞係數的空洞卷積，獲得多個第二特徵圖像。Through the subnet 1, the first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously to obtain multiple third feature images. Then, extracting context features with different ranges for at least two third feature images among the plurality of third feature images can be obtained through hole convolution with depth separable convolution and the convolution kernel corresponding to different hole coefficients to obtain multiple first feature images. Two feature images.

進一步地，可以將多個第二特徵圖像疊加後進行上採樣處理（第8A圖中未示出上採樣過程），得到目標圖像，也可以將多個第一圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後進行上採樣處理，得到目標圖像。Further, it is possible to superimpose multiple second feature images and then perform upsampling processing (the upsampling process is not shown in Figure 8A) to obtain the target image, or combine multiple first images with no context feature The at least one extracted third characteristic image is superimposed and then subjected to up-sampling processing to obtain the target image.

直接將目標圖像作為新的第一特徵圖像，透過子網路2，先對新的第一特徵圖像分多個通道同步進行降維處理，再次獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，例如可以進行深度可分離卷積且卷積核對應不同空洞係數的空洞卷積，獲得多個第二特徵圖像。再次將多個第二特徵圖像疊加後進行上採樣處理，得到目標圖像，也可以將多個第一圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後進行上採樣，得到目標圖像。The target image is directly used as the new first feature image, and through the subnet 2, the new first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously, and multiple third feature images are obtained again. Then extract the context features with different ranges for at least two third feature images in the plurality of third feature images, for example, the depth separable convolution can be performed and the convolution kernel corresponds to the hole convolution with different hole coefficients to obtain multiple The second feature image. The multiple second feature images are superimposed again and then up-sampling is performed to obtain the target image. Alternatively, multiple first images and at least one third feature image that has not been subjected to context feature extraction can be superimposed and then up-sampling is performed. Get the target image.

對子網路2輸出的目標圖像採用雙線性插值演算法，生成所述語義圖像。A bilinear interpolation algorithm is used for the target image output by the subnet 2 to generate the semantic image.

上述實施例中，可以分多次對第一特徵圖像同步提取多個範圍不同的上下文特徵提取及融合，充分融合不同尺度的上下文資訊，提高了語義分割的精度。且由於採用了深度可分離的空洞卷積，減少了語義分割過程中的計算量。In the foregoing embodiment, multiple contextual feature extraction and fusion of different ranges can be synchronously extracted from the first feature image in multiple times, so that context information of different scales is fully integrated, and the accuracy of semantic segmentation is improved. And because the depth of the separable hole convolution is used, the amount of calculation in the semantic segmentation process is reduced.

在另一種可能實現方式中，後端子網路的網路架構可以如第8B圖所示。In another possible implementation, the network architecture of the back terminal network can be as shown in Figure 8B.

透過子網路1，先對第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，例如可以進行深度可分離的空洞卷積操作，空洞係數互不相同，獲得多個第二特徵圖像。Through the subnet 1, the first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously to obtain multiple third feature images. Then extract the context features with different ranges for at least two third feature images in the plurality of third feature images, for example, a depth-separable hole convolution operation can be performed, and the hole coefficients are different from each other to obtain multiple second features image.

為了提高語義分割的效果，可以將多個第二特徵圖像疊加後進行子像素卷積和上採樣處理，獲得目標圖像。或者可以將多個第二特徵圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後進行子像素卷積和上採樣處理（第8B圖中未示出上採樣過程），得到目標圖像。In order to improve the effect of semantic segmentation, multiple second feature images can be superimposed and then subjected to sub-pixel convolution and up-sampling processing to obtain the target image. Alternatively, multiple second feature images and at least one third feature image that has not been subjected to context feature extraction can be superimposed to perform sub-pixel convolution and up-sampling processing (the up-sampling process is not shown in Figure 8B) to obtain the target image.

直接將目標圖像作為新的第一特徵圖像，透過子網路2，先對新的第一特徵圖像分多個通道同步進行降維處理，再次獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，例如可以進行深度可分離的空洞卷積操作，空洞係數互不相同，獲得多個第二特徵圖像。再次將多個第二特徵圖像疊加起來進行子像素卷積和上採樣處理，得到目標圖像，也可以將多個第一圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後進行子像素卷積和上採樣處理，得到目標圖像。The target image is directly used as the new first feature image, and through the subnet 2, the new first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously, and multiple third feature images are obtained again. Then extract the context features with different ranges for at least two third feature images in the plurality of third feature images, for example, a depth-separable hole convolution operation can be performed, and the hole coefficients are different from each other to obtain multiple second features image. Superimpose multiple second feature images again to perform sub-pixel convolution and up-sampling processing to obtain the target image, or superimpose multiple first images with at least one third feature image that has not been subjected to context feature extraction Then, sub-pixel convolution and up-sampling are performed to obtain the target image.

上述實施例中，可以分多次對第一特徵圖像同步提取多個範圍不同的上下文特徵提取及融合，充分融合不同尺度的上下文資訊，提高了語義分割的精度。且由於採用了深度可分離的空洞卷積，減少了語義分割過程中的計算量。另外，還可以透過子像素卷積提高語義分割的效果。In the foregoing embodiment, multiple contextual feature extraction and fusion of different ranges can be synchronously extracted from the first feature image in multiple times, so that context information of different scales is fully integrated, and the accuracy of semantic segmentation is improved. And because the depth of the separable hole convolution is used, the amount of calculation in the semantic segmentation process is reduced. In addition, the effect of semantic segmentation can also be improved through sub-pixel convolution.

在另一種可能地實現方式中，後端子網路的網路架構可以如第8C圖所示。In another possible implementation, the network architecture of the back terminal network can be as shown in Figure 8C.

透過子網路1，先對第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個提取範圍不同的上下文特徵，例如可以進行深度可分離的空洞卷積操作，空洞係數互不相同，獲得多個第二特徵圖像。Through the subnet 1, the first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously to obtain multiple third feature images. Then, for at least two contextual features with different extraction ranges in the plurality of third feature images, for example, a depth-separable hole convolution operation may be performed, and the hole coefficients are different from each other to obtain a plurality of second feature images.

進一步地，可以將多個第二特徵圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後，再與第五特徵圖像進行疊加，對疊加後的圖像進行上採樣處理（第8C圖中未示出上採樣過程），得到目標圖像。其中，第五特徵圖像對應的特徵提取的層數小於所述第一特徵圖像對應的特徵提取的層數。Further, multiple second feature images and at least one third feature image that has not been subjected to context feature extraction can be superimposed, and then superimposed with the fifth feature image, and the superimposed image can be upsampled ( The up-sampling process is not shown in Figure 8C) to obtain the target image. Wherein, the number of feature extraction layers corresponding to the fifth feature image is smaller than the number of feature extraction layers corresponding to the first feature image.

直接將目標圖像作為新的第一特徵圖像，透過子網路2，先對新的第一特徵圖像分多個通道同步進行降維處理，再次獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，例如可以進行深度可分離的空洞卷積操作，空洞係數互不相同，獲得多個第二特徵圖像。再次將多個第二特徵圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後，對疊加後的圖像進行上採樣處理，得到目標圖像。The target image is directly used as the new first feature image, and through the subnet 2, the new first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously, and multiple third feature images are obtained again. Then extract the context features with different ranges for at least two third feature images in the plurality of third feature images, for example, a depth-separable hole convolution operation can be performed, and the hole coefficients are different from each other to obtain multiple second features image. After superimposing a plurality of second feature images and at least one third feature image that has not been subjected to context feature extraction, up-sampling processing is performed on the superimposed images to obtain a target image.

在另一種可能地實現方式中，後端子網路的網路架構可以如第8D圖所示。In another possible implementation manner, the network architecture of the back terminal network can be as shown in Figure 8D.

進一步地，可以將多個第二特徵圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後、進行子像素卷積和上採樣處理（第8D圖未示出上採樣過程），再與第五特徵圖像疊加，得到目標圖像。其中，第五特徵圖像對應的特徵提取的層數小於所述第一特徵圖像對應的特徵提取的層數。Further, after superimposing multiple second feature images and at least one third feature image that has not been subjected to context feature extraction, sub-pixel convolution and up-sampling processing can be performed (the up-sampling process is not shown in Figure 8D), Then it is superimposed with the fifth characteristic image to obtain the target image. Wherein, the number of feature extraction layers corresponding to the fifth feature image is smaller than the number of feature extraction layers corresponding to the first feature image.

直接將目標圖像作為新的第一特徵圖像，透過子網路2，先對新的第一特徵圖像分多個通道同步進行降維處理，再次獲得多個第三特徵圖像。再對多個第三特徵圖像中的至少兩個提取範圍不同的上下文特徵，例如可以進行深度可分離的空洞卷積操作，空洞係數互不相同，獲得多個第二特徵圖像。再次將多個第二特徵圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後，對疊加後的圖像進行子像素卷積和上採樣處理，得到目標圖像。The target image is directly used as the new first feature image, and through the subnet 2, the new first feature image is divided into multiple channels to perform dimensionality reduction processing simultaneously, and multiple third feature images are obtained again. Then, for at least two contextual features with different extraction ranges in the plurality of third feature images, for example, a depth-separable hole convolution operation may be performed, and the hole coefficients are different from each other to obtain a plurality of second feature images. After superimposing a plurality of second feature images and at least one third feature image that has not been subjected to context feature extraction, sub-pixel convolution and up-sampling processing are performed on the superimposed images to obtain a target image.

另外，為了確保目標圖像的維度為目標維度，可以在將多個第二特徵圖像和未進行上下文特徵提取的至少一個第三特徵圖像疊加後，對疊加後的圖像進行降維處理和升維處理，再進行子像素卷積和上採樣處理，獲得目標圖像。In addition, in order to ensure that the dimension of the target image is the target dimension, after superimposing a plurality of second feature images and at least one third feature image that has not been subjected to context feature extraction, the superimposed image may be subjected to dimensionality reduction processing. And the dimension upscaling process, then the sub-pixel convolution and up-sampling process are performed to obtain the target image.

上述實施例中，可以分多次對第一特徵圖像同步提取多個範圍不同的上下文特徵提取及融合，充分融合不同尺度的上下文資訊，提高了語義分割的精度。由於採用了深度可分離的空洞卷積，減少了語義分割過程中的計算量。另外，採用第五特徵圖像確定目標圖像，可以確保待檢測圖像中的重要資訊不會丟失，同樣提高了語義分割的效果。In the foregoing embodiment, multiple contextual feature extraction and fusion of different ranges can be synchronously extracted from the first feature image in multiple times, so that context information of different scales is fully integrated, and the accuracy of semantic segmentation is improved. Due to the use of depth-separable hollow convolution, the amount of calculation in the semantic segmentation process is reduced. In addition, using the fifth feature image to determine the target image can ensure that important information in the image to be detected will not be lost, and the effect of semantic segmentation is also improved.

在一些可選實施例中，例如第9圖所示，在完成步驟104之後，該方法還可以包括：在步驟106中，根據所述語義圖像進行機器設備導航。In some optional embodiments, such as shown in Figure 9, after step 104 is completed, the method may further include: in step 106, machine device navigation is performed according to the semantic image.

本揭露實施例中，可以根據生成的語義圖像對機器設備進行導航。例如語義圖像中包括障礙物，則可以進行躲避障礙物的導航，語義圖像中包括岔路口，則可以根據指定路線，確定是否需要直行或轉彎等。In the embodiment of the present disclosure, the machine equipment can be navigated according to the generated semantic image. For example, if obstacles are included in the semantic image, navigation to avoid obstacles can be performed. If the semantic image includes a fork in the road, it can be determined whether it is necessary to go straight or turn according to the designated route.

上述實施例中，可以根據生成的待處理圖像對應的語義圖像進行機器設備導航，可用性高。In the foregoing embodiment, machine device navigation can be performed according to the generated semantic image corresponding to the image to be processed, which has high usability.

與前述方法實施例相對應，本揭露還提供了裝置的實施例。Corresponding to the foregoing method embodiments, the present disclosure also provides device embodiments.

如第10圖所示，第10圖是本揭露根據一示例性實施例示出的一種圖像語義分割裝置框圖，裝置包括：特徵提取模組210，配置為對獲取到的待處理圖像進行特徵提取，獲得第一特徵圖像；上下文特徵提取模組220，配置為對所述第一特徵圖像同步提取多個範圍不同的上下文特徵，獲得多個第二特徵圖像；確定模組230，配置為至少根據所述多個第二特徵圖像，確定目標圖像，並將所述目標圖像作為新的所述第一特徵圖像再次同步提取多個範圍不同的上下文特徵；語義圖像生成模組240，配置為回應於對所述第一特徵圖像同步提取多個範圍不同的上下文特徵的次數達到目標次數，基於最後一次獲得的所述目標圖像，生成所述待處理圖像對應的語義圖像。As shown in Figure 10, Figure 10 is a block diagram of an image semantic segmentation device according to an exemplary embodiment of the present disclosure. The device includes: a feature extraction module 210 configured to perform processing on the acquired image to be processed Feature extraction to obtain a first feature image; a context feature extraction module 220 configured to simultaneously extract a plurality of context features with different ranges from the first feature image to obtain a plurality of second feature images; a determination module 230 , Configured to determine a target image at least according to the plurality of second characteristic images, and use the target image as the new first characteristic image to synchronously extract a plurality of contextual features with different ranges; semantic map The image generation module 240 is configured to generate the to-be-processed image based on the target image obtained the last time in response to the number of times of synchronously extracting a plurality of contextual features in different ranges from the first characteristic image reaching a target number of times Like the corresponding semantic image.

在一些可選實施例中，所述上下文特徵提取模組220包括：第一處理子模組，配置為對所述第一特徵圖像分多個通道同步進行降維處理，獲得多個第三特徵圖像；第二處理子模組，配置為對所述多個第三特徵圖像中的至少兩個第三特徵圖像提取範圍不同的上下文特徵，獲得多個第二特徵圖像。In some optional embodiments, the context feature extraction module 220 includes: a first processing sub-module configured to simultaneously perform dimensionality reduction processing on the first feature image in multiple channels to obtain multiple third A feature image; a second processing sub-module configured to extract context features with different ranges from at least two third feature images in the plurality of third feature images to obtain a plurality of second feature images.

在一些可選實施例中，所述第一確定子模組，配置為將所述多個第二特徵圖像進行疊加，得到所述第四特徵圖像；或者，對所述多個第二特徵圖像和多個第三特徵圖像中的至少一個第三特徵圖像進行疊加，得到所述第四特徵圖像。In some optional embodiments, the first determining sub-module is configured to superimpose the plurality of second characteristic images to obtain the fourth characteristic image; or, for the plurality of second characteristic images The characteristic image and at least one third characteristic image of the plurality of third characteristic images are superimposed to obtain the fourth characteristic image.

對於裝置實施例而言，由於其基本對應於方法實施例，所以相關之處參見方法實施例的部分說明即可。以上所描述的裝置實施例僅僅是示意性的，其中作為分離部件說明的單元可以是或者也可以不是物理上分開的，作為單元顯示的部件可以是或者也可以不是物理單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部模組來實現本揭露方案的目的。本領域普通技術人員在不付出進步性勞動的情況下，即可以理解並實施。For the device embodiment, since it basically corresponds to the method embodiment, the relevant part can refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place. , Or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. Those of ordinary skill in the art can understand and implement it without making progressive labor.

本揭露實施例還提供了一種電腦可讀儲存介質，儲存介質儲存有電腦程式，電腦程式用於執行上述任一項所述的圖像語義分割方法。The embodiment of the present disclosure also provides a computer-readable storage medium, and the storage medium stores a computer program, and the computer program is used to execute any one of the aforementioned image semantic segmentation methods.

本公開實施例還提供了一種電腦程式，所述電腦程式使得電腦執行上述任一項所述的圖像語義分割方法。The embodiment of the present disclosure also provides a computer program that enables a computer to execute the image semantic segmentation method described in any one of the above.

在一些可選實施例中，本公開實施例提供了一種電腦程式產品，包括電腦可讀代碼，當電腦可讀代碼在設備上運行時，設備中的處理器執行用於實現如上任一實施例提供的圖像語義分割方法的指令。In some optional embodiments, the embodiments of the present disclosure provide a computer program product, including computer-readable code. When the computer-readable code runs on the device, the processor in the device executes to implement any of the above embodiments. Provide instructions for image semantic segmentation methods.

在一些可選實施例中，本揭露實施例還提供了另一種電腦程式產品，用於儲存電腦可讀指令，指令被執行時使得電腦執行上述任一實施例提供的圖像語義分割方法的操作。In some optional embodiments, the disclosed embodiments also provide another computer program product for storing computer-readable instructions, which when executed, cause the computer to perform the operations of the image semantic segmentation method provided by any of the above-mentioned embodiments .

該電腦程式產品可以具體透過硬體、軟體或其結合的方式實現。在一個可選實施例中，所述電腦程式產品具體體現為電腦儲存介質，在另一個可選實施例中，電腦程式產品具體體現為軟體產品，例如軟體發展包(Software Development Kit，SDK)等等。The computer program product can be implemented through hardware, software, or a combination thereof. In an optional embodiment, the computer program product is specifically embodied as a computer storage medium. In another optional embodiment, the computer program product is specifically embodied as a software product, such as a software development kit (SDK), etc. Wait.

本揭露實施例還提供了一種圖像語義分割裝置，包括：處理器；用於儲存處理器可執行指令的記憶體；其中，處理器被配置為調用所述記憶體中儲存的可執行指令，實現上述任一項所述的圖像語義分割方法。The embodiment of the present disclosure also provides an image semantic segmentation device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the executable instructions stored in the memory, Realize any one of the image semantic segmentation methods described above.

第11圖為本申請實施例提供的一種圖像語義分割裝置的硬體結構示意圖。該圖像語義分割裝置310包括處理器311，還可以包括輸入裝置312、輸出裝置313和記憶體314。該輸入裝置312、輸出裝置313、記憶體314和處理器311之間透過匯流排相互連接。Figure 11 is a schematic diagram of the hardware structure of an image semantic segmentation device provided by an embodiment of the application. The image semantic segmentation device 310 includes a processor 311, and may also include an input device 312, an output device 313, and a memory 314. The input device 312, the output device 313, the memory 314 and the processor 311 are connected to each other through a bus.

記憶體包括但不限於是隨機儲存記憶體（Random Access Memory，RAM）、唯讀記憶體（Read-Only Memory，ROM）、可擦除可程式設計唯讀記憶體（Erasable Programmable Read Only Memory，EPROM）、或可擕式唯讀記憶體（Compact Disc Read-Only Memory，CD-ROM），該記憶體用於相關指令及資料。Memory includes but is not limited to Random Access Memory (RAM), Read-Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM) ), or portable CD-ROM (Compact Disc Read-Only Memory, CD-ROM), which is used for related commands and data.

輸入裝置用於輸入資料和/或信號，以及輸出裝置用於輸出資料和/或信號。輸出裝置和輸入裝置可以是獨立的器件，也可以是一個整體的器件。The input device is used to input data and/or signals, and the output device is used to output data and/or signals. The output device and the input device can be independent devices or a whole device.

處理器可以包括是一個或多個處理器，例如包括一個或多個中央處理器（Central Processing Unit，CPU），在處理器是一個CPU的情況下，該CPU可以是單核CPU，也可以是多核CPU。The processor may include one or more processors, for example, including one or more central processing units (CPU). In the case that the processor is a CPU, the CPU may be a single-core CPU or Multi-core CPU.

記憶體用於儲存網路設備的程式碼和資料。The memory is used to store the code and data of the network equipment.

處理器用於調用該記憶體中的程式碼和資料，執行上述方法實施例中的步驟。具體可參見方法實施例中的描述，在此不再贅述。The processor is used to call the program code and data in the memory to execute the steps in the above method embodiment. For details, please refer to the description in the method embodiment, which will not be repeated here.

可以理解的是，第11圖僅僅示出了一種圖像語義分割裝置的簡化設計。在實際應用中，圖像語義分割裝置還可以分別包含必要的其他元件，包含但不限於任意數量的輸入/輸出裝置、處理器、控制器、記憶體等，而所有可以實現本申請實施例的圖像語義分割裝置都在本揭露實施例的保護範圍之內。It is understandable that Figure 11 only shows a simplified design of an image semantic segmentation device. In practical applications, the image semantic segmentation device may also include other necessary components, including but not limited to any number of input/output devices, processors, controllers, memory, etc., and all the devices that can implement the embodiments of the present application The image semantic segmentation devices are all within the protection scope of the embodiment of the disclosure.

在一些實施例中，本揭露實施例提供的裝置具有的功能或包含的模組可以用於執行上文方法實施例描述的方法，其具體實現可以參照上文方法實施例的描述，為了簡潔，這裡不再贅述。In some embodiments, the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments. For specific implementation, please refer to the description of the above method embodiments. For brevity, I won't repeat it here.

本領域技術人員在考慮說明書及實踐這裡公開的發明後，將容易想到本揭露實施例的其它實施方案。本揭露實施例旨在涵蓋本揭露的任何變型、用途或者適應性變化，這些變型、用途或者適應性變化遵循本揭露實施例的一般性原理並包括本揭露實施例未公開的本技術領域中的公知常識或者慣用技術手段。說明書和實施例僅被視為示例性的，本揭露實施例的真正範圍和精神由下面的發明申請專利範圍指出。After considering the specification and practicing the invention disclosed herein, those skilled in the art will easily think of other implementation schemes of the embodiments of the present disclosure. The embodiments of the present disclosure are intended to cover any modifications, uses, or adaptive changes of the present disclosure. These modifications, uses, or adaptive changes follow the general principles of the embodiments of the present disclosure and include those in the technical field that are not disclosed in the embodiments of the present disclosure. Common knowledge or conventional technical means. The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the embodiments of the present disclosure are pointed out by the following invention patent application scope.

以上所述僅為本揭露的可選實施例而已，並不用以限制本揭露實施例，凡在本揭露實施例的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本揭露實施例保護的範圍之內。The foregoing descriptions are only optional embodiments of the present disclosure, and are not intended to limit the embodiments of the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present disclosure shall be applicable. It is included in the protection scope of the embodiments of the present disclosure.

101,102,102-1,102-2,103,103-1,103-2,104,105,106:步驟 210:特徵提取模組 220:下文特徵提取模組 230:確定模組 240:語義圖像生成模組 310:圖像語義分割裝置 311:處理器 312:輸入裝置 313:輸出裝置 314:記憶體101,102,102-1,102-2,103,103-1,103-2,104,105,106: steps 210: Feature Extraction Module 220: The following feature extraction module 230: Determine the module 240: Semantic Image Generation Module 310: Image Semantic Segmentation Device 311: processor 312: input device 313: output device 314: Memory

Claims

A method for image semantic segmentation, including: Perform feature extraction on the acquired image to be processed to obtain a first feature image; Synchronously extracting a plurality of contextual features with different ranges from the first feature image to obtain a plurality of second feature images; Determining a target image at least according to the plurality of second characteristic images, and using the target image as the new first characteristic image to synchronously extract a plurality of contextual features with different ranges; In response to the number of times of synchronously extracting a plurality of contextual features with different ranges from the first feature image reaching a target number of times, a semantic image corresponding to the image to be processed is generated based on the target image obtained last time.

The method according to claim 1, wherein the synchronously extracting a plurality of contextual features with different ranges from the first characteristic image to obtain a plurality of second characteristic images includes: Perform dimensionality reduction processing on the first feature image in multiple channels simultaneously to obtain multiple third feature images; and A plurality of second feature images are obtained by extracting context features with different ranges from at least two third feature images in the plurality of third feature images.

The method according to claim 2, wherein the extracting context features with different ranges from at least two third feature images of the plurality of third feature images to obtain multiple second feature images includes : The depth separable convolution is adopted and the convolution kernel corresponds to the hole convolution with different hole coefficients, and context features with different ranges are extracted from at least two third feature images in the plurality of third feature images to obtain the multiple A second feature image.

The method according to any one of claim items 1-3, wherein the determining a target image at least according to the plurality of second characteristic images includes: Fusing at least the plurality of second characteristic images to obtain a fourth characteristic image; and The target image is determined based on at least the fourth characteristic image.

The method according to claim 4, wherein the fusing at least the plurality of second characteristic images to obtain a fourth characteristic image includes: Superimposing the plurality of second characteristic images to obtain the fourth characteristic image; or, At least one third characteristic image of the plurality of second characteristic images and the plurality of third characteristic images is superimposed to obtain the fourth characteristic image.

The method according to claim 4, wherein the determining the target image according to at least the fourth characteristic image includes: Up-sampling the fourth characteristic image to obtain the target image; or, Performing sub-pixel convolution on the fourth characteristic image to obtain the target image.

The method according to claim 4, wherein the method further includes: After performing feature extraction and dimensionality reduction processing on the image to be processed, a fifth feature image is obtained; wherein the number of feature extraction layers corresponding to the fifth feature image is smaller than the number of features corresponding to the first feature image The number of layers extracted; The determining the target image at least according to the fourth characteristic image includes: In the case that the number of times is less than the target number of times, the fourth characteristic image and the fifth characteristic image are superimposed and then up-sampling is performed to obtain the target image; or, In the case that the number of times is less than the target number of times, an image obtained after sub-pixel convolution of the fourth characteristic image is superimposed on the fifth characteristic image to obtain the target image.

The method according to any one of claim items 1-3, wherein the dimension corresponding to the target image obtained last time is a target dimension; wherein the target dimension is a preset semantic image The total number of object categories included in it is determined.

The method according to any one of claim items 1-3, wherein after the generating the semantic image corresponding to the image to be processed, the method further includes: Navigate the machine and equipment according to the semantic image.

A computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to execute the image semantic segmentation method described in any one of the above-mentioned request items 1-9.

An image semantic segmentation device, including: processor; A memory for storing executable instructions of the processor; Wherein, the processor is configured to call executable instructions stored in the memory to implement the image semantic segmentation method described in any one of request items 1-9.