WO2019136623A1

WO2019136623A1 - Apparatus and method for semantic segmentation with convolutional neural network

Info

Publication number: WO2019136623A1
Application number: PCT/CN2018/072050
Authority: WO
Inventors: Zhijie Zhang
Original assignee: Nokia Technologies Oy; Nokia Technologies (Beijing) Co., Ltd.
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2019-07-18

Abstract

Systems, methods and apparatus are provided for sematic segmentation with convolutional neural networks. A system comprises a first encoder-decoder network, which is configured to perform a plurality of layers of convolution on a raw data material to generate a first set of feature maps. The system further comprises a group of middle encoder-decoder networks cascaded with the first encoder-decoder network. The group of middle encoder-decoder networks is configured to perform a plurality of layers of convolution on the first set of feature maps to generate a second set of feature maps for classifying objects of the raw data material.

Description

APPARATUS AND METHOD FOR SEMANTIC SEGMENTATION WITH CONVOLUTIONAL NEURAL NETWORK

FIELD OF THE INVENTION

The present invention generally relates to information process technologies. More specifically, the invention relates to systems, apparatus, and methods for semantic segmentation with convolutional neural network.

BACKGROUND

As a key task of Artificial Intelligence (AI) , semantic segmentation is widely explored nowadays. Due to a deep understanding to a vision, semantic segmentation has a wide array of applications comprising, for example, object detection, semantic parsing, scene understanding, human-machine interaction (HMI) , visual surveillance, Advanced Driver Assistant Systems (ADAS) , and so on. Especially, it plays a central role in the popular Autonomous Driving for semantic analysis and decision-making.

To segment all the objects regions in one image into different categories, making a prediction at every pixel is needed. As an instance of deep learning, convolutional neural networks have obtained great success for semantic segmentation. Some instances of convolutional neural networks, such as FCN (Fully Convolutional Networks) of “IEEE Conference on Computer Vision and Pattern Recognition” (CVPR) 2015, SegNet of “IEEE Transactions on Pattern Analysis and Machine Intelligence” (PAMI) 2017 and their variations, improve the performance of segmentation significantly compared with traditional technique.

Fig. 1 shows an exemplary application scenario of semantic segmentation with convolutional neural networks. The application is for autonomous driving. A car is running on a road as shown in an image 110. There can be a camera 112 installed on the car, which can capture photos and/or videos of road scene in real-time while the car is running. For example, an image 120 is one photo captured by the camera 112 in an instant. Through semantic segmentation on the image 120, objects in the image 120 can be classified into 12 classes of interest (see legends in Fig. 1) , as shown in image 130. Then, some road scene understanding applications may have ability to model appearances (e.g. road, building) , shapes (e.g. cars, pedestrians) and understand the spatial relationship (or context) between different classes such as road and side-walk, so as to support autonomous driving.

However, scenes in the real-world are always complex and involve many objects and various categories, which makes it difficult to directly fit the non-linear relationship and understanding the complex scenes by machines. For traditional convolutional neural networks, the semantic information can’t be extracted adequately, which causes the inaccurate segmented of objects. Thereby, it would be important to improve the accuracy of sematic segmentation.

SOME EXAMPLE EMBODIMENTS

To overcome the problem described above, and to overcome the limitations that will be apparent upon reading and understanding the prior arts, the disclosure provides an approach for effective sematic segmentation.

According to a first aspect of the present invention, a system is provided for sematic segmentation with convolutional neural networks. The system comprises a first encoder-decoder network, which is configured to perform a plurality of layers of convolution on a raw data material to generate a first set of feature maps. The system further comprises a group of middle encoder-decoder networks cascaded with the first encoder-decoder network. The group of middle encoder-decoder networks is configured to perform a plurality of layers of convolution on the first set of feature maps to generate a second set of feature maps for classifying objects of the raw data material.

In an exemplary embodiment, the group of middle encoder-decoder networks may comprise one encoder-decoder network or at least two encoder-decoder networks cascaded one-by-one.

In an exemplary embodiment, a final encoder-decoder network of the group of middle encoder-decoder network may be configured to classify objects of the raw data material using the second set of feature maps.

In an exemplary embodiment, the system may further comprise an object classifier coupled with a final encoder-decoder network of the group of middle encoder-decoder networks. The object classifier is configured to classify objects of the raw data material using the second set of feature maps. The object classifier may be a one-by-one convolution classifier, which is configured to reduce a dimension of the second set of feature maps to a dimension of the objects of the raw data material.

In an exemplary embodiment, the final encoder-decoder network may be configured to reduce a dimension of feature maps to approach a dimension of the objects of the raw data material.

In an exemplary embodiment, the first encoder-decoder network may be further configured to generate the first set of feature maps without dimension reductionality.

In an exemplary embodiment, the first encoder-decoder network is configured to perform up-sampling to cause that a resolution of the first set of feature maps is same as a resolution of the raw data material.

In an exemplary embodiment, the raw data material is a raw data of an image.

In an exemplary embodiment, at least one of the first encoder-decoder network and the middle encoder-decoder networks comprises an encoding path and a decoding path, wherein an architecture of the decoding path is symmetrical with that of the encoding path.

According to a second aspect of the present invention, a method is provided for sematic segmentation with a convolutional neural network. The method comprises performing a plurality of layers of convolution in series on a raw data material via a first encoder-decoder network, to generate a first set of feature maps. The method further comprises performing a plurality of layers of convolution in series on the first set of feature maps via a group of middle encoder-decoder networks cascaded with the first encoder-decoder network, to generate a second set of feature maps for classifying objects of the raw data material. The first set of feature maps may be generated without dimension reductionality.

In an exemplary embodiment, the group of middle encoder-decoder networks comprises one encoder-decoder network or at least two encoder-decoder networks cascaded one-by-one.

In an exemplary embodiment, the method further comprises classifying objects of the raw data material using the second set of feature maps, via a final encoder-decoder network of the group of middle encoder-decoder networks.

In another exemplary embodiment, the method further comprises classifying objects of the raw data material using the second set of feature maps via an object classifier coupled with a final encoder-decoder network of the group of middle encoder-decoder networks.

In an exemplary embodiment, the object classifier is a one-by-one convolution classifier. The objects are classified by reducing a dimension of the second set of feature maps to a dimension of the objects of the raw data material.

In an exemplary embodiment, the method further comprises reducing a dimension of feature maps to approach a dimension of the objects of the raw data material, via the final encoder-decoder network.

In an exemplary embodiment, a plurality of layers of convolution in at least one of the first encoder-decoder network and the middle encoder-decoder networks are performed by performing down-sampling in an encoding path, and performing up-sampling in a decoding path. The architecture of the decoding path may be symmetrical with that of the encoding path.

In an exemplary embodiment, the up-sampling is performed to cause that a resolution of the feature maps is same as a resolution of the raw data material.

According to a three aspect of the present invention embodiment, an apparatus comprises at least one processor, and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause, at least in part, the apparatus to perform one of the methods discussed above.

According to an fourth aspect of the present invention, a computer-readable storage medium carrying one or more sequences of one or more instructions which, when executed by one or more processors, cause, at least in part, an apparatus to perform one of the methods discussed above.

According to a fifth aspect of the present invention, an apparatus comprises means for performing one of the methods discussed above.

According to a sixth aspect of the present invention, a computer program product including one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus to at least perform one of the methods discussed above.

Still other aspects, features, and advantages of the invention are readily apparent from the following detailed description, simply by illustrating a number of particular embodiments and implementations, including the best mode contemplated for carrying out the invention. The invention is also capable of other and different embodiments, and its several details may be modified in various obvious respects, all without departing from the spirit and scope of the invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings:

Figure 1 illustrates an exemplary application scenario of semantic segmentation with convolutional neural networks;

Figure 2 is a block diagram of an apparatus that may be specifically configured in accordance with an example embodiment of the present invention;

Figure 3 illustrates an exemplary architecture of a single encoder-decoder network for semantic segmentation;

Figure 4 illustrates an exemplary convolutional neural network based on cascaded encoder-decoder architecture according to at least some of embodiments of the present invention;

Figure 5 is a flow chart of an exemplary training process for a convolutional neural network according to at least some of embodiments of the present invention;

Figure 6 is a graph showing results of the present invention as compared with an existing leading technique; and

Figure 7 is a flow chart of an exemplary process for semantic segmentation according to at least some of embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Some embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all, embodiments of the invention are shown. Indeed, various embodiments of the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. It is apparent to one skilled in the art that the embodiments of the invention may be practiced without these specific details or with an equivalent arrangement. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the embodiments of the invention. Like reference numerals refer to like elements throughout.

As used herein, the term “coupled” along with its derivatives, may be used. “Coupled” is used to indicate that two or more elements co-operate or interact with each other, but they may or may not have intervening physical or electrical components between them.

Additionally, as used herein, the term “circuitry” refers to (a) hardware-only circuit implementations (e.g., implementations in analog circuitry and/or digital circuitry) ; (b) combinations of circuits and computer program product (s) comprising software and/or firmware instructions stored on one or more computer readable memories that work together to cause an apparatus to perform one or more functions described herein; and (c) circuits, such as, for example, a microprocessor (s) or a portion of a microprocessor (s) , that require software or firmware for operation even if the software or firmware is not physically present. This definition of “circuitry” applies to all uses of this term herein, including in any claims. As a further example, as used herein, the term “circuitry” also includes an implementation comprising one or more processors and/or portion (s) thereof and accompanying software and/or firmware. As another example, the term “circuitry” as used herein also includes, for example, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, other network device, and/or other computing device.

As defined herein, a “computer-readable storage medium” , which refers to a non-transitory physical storage medium (e.g., volatile or non-volatile memory device) , can be differentiated from a “computer-readable transmission medium” , which refers to an electromagnetic signal.

Method, apparatuses, and computer program products are provided in accordance with example embodiments of the present invention to provide effective semantic segmentation with convolutional neural network based on encoder-decoder architecture.

Referring to Figure 2, an apparatus 200 for semantic segmentation with convolutional neural network in accordance with an example embodiment may include or otherwise be in communication with one or more of at least one processor 202, at least one memory 204, at least one communication interface 206, at least one input/output interface 208, and a semantic segmentation module 210.

It should be noted that while Figure 2 illustrates one exemplary configuration of an apparatus 200 for semantic segmentation with convolutional neural network, numerous other configurations may also be used to implement other embodiments of the present invention. As such, in some embodiments, although devices or elements are shown as being in communication with each other, hereinafter such devices or elements should be considered to be capable of being embodied within the same device or element and thus, devices or elements shown in communication should be understood to alternatively be portions of the same device or element.

In some embodiments, the processor 202 (and/or co-processors or any other processing circuitry assisting or otherwise associated with the processor) may be in communication with the memory 204 via a bus for passing information among components of the apparatus. The memory 204 may include, for example, a non-transitory memory, such as one or more volatile and/or non-volatile memories. In other words, for example, the memory 204 may be an electronic storage device (e.g., a computer readable storage medium) comprising gates configured to store data (e.g., bits) that may be retrievable by a machine (e.g., a computing device like the processor) . The memory 204 may be configured to store information, data, content, applications, instructions, or the like for enabling the apparatus to carry out various functions in accordance with an example embodiment of the present invention. For example, the memory 204 could be configured to buffer input data for processing by the processor 202. Additionally or alternatively, the memory 204 could be configured to store instructions for execution by the processor.

In some embodiments, the apparatus 200 may be embodied as a chip or chip set. In other words, the apparatus 200 may comprise one or more physical packages (e.g., chips) including materials, components and/or wires on a structural assembly (e.g., a baseboard) . The structural assembly may provide physical strength, conservation of size, and/or limitation of electrical interaction for component circuitry included thereon. The apparatus 200 may therefore, in some cases, be configured to implement an embodiment of the present invention on a single chip or as a single “system on a chip. ” As such, in some cases, a chip or chipset may constitute means for performing one or more operations for providing the functionalities described herein.

The processor 202 may be embodied in a number of different ways. For example, the processor 202 may be embodied as one or more of various hardware processing means such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP) , a processing element with or without an accompanying DSP, or various other processing circuitry including integrated circuits such as, for example, an ASIC (application specific integrated circuit) , an FPGA (field programmable gate array) , a microcontroller unit (MCU) , a hardware accelerator, a special-purpose computer chip, or the like. As such, in some embodiments, the processor 202 may include one or more processing cores configured to perform independently. A multi-core processor may enable multiprocessing within a single physical package. Additionally or alternatively, the processor 202 may include one or more processors configured in tandem via the bus to enable independent execution of instructions, pipelining and/or multithreading.

In an example embodiment, the processor 202 may be configured to execute instructions stored in the memory 204 or otherwise accessible to the processor 202. Alternatively or additionally, the processor 202 may be configured to execute hard coded functionality. As such, whether configured by hardware or software methods, or by a combination thereof, the processor 202 may represent an entity (e.g., physically embodied in circuitry) capable of performing operations according to an embodiment of the present invention while configured accordingly. Thus, for example, when the processor 202 is embodied as an ASIC, FPGA, or the like, the processor may be specifically configured hardware for conducting the operations described herein. Alternatively, as another example, when the processor 202 is embodied as an executor of software instructions, the instructions may specifically configure the processor to perform the algorithms and/or operations described herein when the instructions are executed. However, in some cases, the processor 202 may be a processor of a specific device configured to employ an embodiment of the present invention by further configuration of the processor by instructions for performing the algorithms and/or operations described herein. The processor 202 may include, among other things, a clock, an arithmetic logic unit (ALU) , and logic gates configured to support operation of the processor.

The communication interface 206 may be any means such as a device or circuitry embodied in either hardware or a combination of hardware and software that is configured to receive and/or transmit data from/to a network and/or any other device or module in communication with the apparatus 200. In this regard, the communication interface may include, for example, an antenna (or multiple antennas) and supporting hardware and/or software for enabling communications with a wireless communication network. Additionally or alternatively, the communication interface may include the circuitry for interacting with the antenna (s) to cause transmission of signals via the antenna (s) or to handle receipt of signals received via the antenna (s) . In some environments, the communication interface may alternatively or also support wired communication. As such, for example, the communication interface may include a communication modem and/or other hardware/software for supporting communication via cable, digital subscriber line (DSL) , universal serial bus (USB) , Ethernet, High-Definition Multimedia Interface (HDMI) , or other mechanisms. Furthermore, the communication interface 206 may include hardware and/or software for supporting communication mechanisms such as

Infrared, UWB, WiFi, and/or the like.

The apparatus 200 may include an input/output interface 208 that may, in turn, be in communication with the processor 202 to receive input from and to provide output to a user. For example, the input/output interface may include a display and, in some embodiments, may also include a keyboard, a mouse, a joystick, a touch screen, touch areas, soft keys, a microphone, a speaker, or other input/output mechanisms. The processor 202 may comprise user interface circuitry configured to control at least some functions of one or more input/output interface elements such as a display and, in some embodiments, a speaker, ringer, microphone, and/or the like. The processor and/or user interface circuitry comprising the processor may be configured to control one or more functions of one or more input/output interface elements through computer program instructions (e.g., software and/or firmware) stored on a memory accessible to the processor (e.g., memory 204, and/or the like) .

The apparatus 200 may include semantic segmentation circuitry 210, which may be configured to receive a raw data material, such as a digital image data, video data, textual data, audio data and/or the like, to perform a plurality of layers of convolution on the raw data material via more than one cascaded encoder-decoder networks, to generate a large amount of feature maps from the raw data material, and to classify objects of the raw data material by using the feature maps. The semantic segmentation circuitry 210 may be implemented using hardware components of apparatus 200 configured by either hardware or software for implementing these features. For example, semantic segmentation circuitry 210 may utilize processing circuitry, such as processor 202 and memory 204, to perform such operations.

Convolutional neural networks are generally designed based on encoder-decoder architecture. The encoder performs convolution, while the decoder is responsible for deconvolution and un-pooling/up-sampling to predict pixel-wise class labels. There are many convolutional neural networks for segmentation which are explored based on this architecture, such as FCN, U-Net, Deconvolutional Networks, SegNet, as well as their modified variations. All of these existing convolutional neural networks adopt only once encode-decode procedures for getting the final segmentation results. In this disclosure, this kind of techniques can be referred to as single encoder-decoder network solutions, for differentiating them from the cascaded encoder-decoder architecture to be introduced latter.

Figure 3 illustrates an exemplary architecture of a single encoder-decoder network for semantic segmentation. As shown in Figure 3, a convolutional neural network 300 is configured to provide semantic segmentation for images based on architecture of single encoder-decoder network. The convolutional neural network 300 receives an image input 310 and outputs a corresponding image output 320 with pixel-wise semantic segmentation. It may be implemented in the semantic segmentation circuitry 210. The convolutional neural network 300 may be divided into two parts -encoder network 340 and decoder network 350. The image input 310 is fed into the encoder network 340. The encoder network corresponds to feature extractor that transforms the input image to multidimensional feature representation, whereas the decoder network is a shape generator that produces object segmentation from the feature extracted from the encoder network. Each layer in the encoder network may have a corresponding layer in the decoder network. The final output of the network is a probability map in the same size to input image, indicating probability of each pixel that belongs to one of the predefined classes.

The encoder network 340 includes numerous convolutional layers (341, 342, …, 351) , which perform multiple series of convolution with a set of filter banks to yield feature maps. For example, the convolution of

convolutional layers

341, 342, 344, 345, 347, 348, and 351 may be performed with a stride of one, while the convolution of

convolutional layers

343, 346, and 349 may be performed with a stride of two. In this way, the resultant image is down-sampled. Alternatively, the down-sampling may be offered with a pooling layer.

Contrary to the encoder network 340, the decoder network 360 performs multiple series of convolutions, up-sampling or un-pooling on the input feature map (s) produced from the encoder network. This procedure produces sparse feature map (s) . The decoder network 360 includes numerous de-convolution layers and convolutional layers. For example, the de-convolution of

de-convolutional layers

361, 363, and 365 may be performed with a stride of two for up-sampling, while the convolution of

convolutional layers

362, 364, and 366 may be performed with a stride of one. Finally, the high dimensional feature map (s) representation may be fed to a soft-max classifier 367 that classifies every pixel independently. The classifier 367 may be a 1x1 convolution layer. It outputs a certain number of channels of image probabilities. The number of channels may be equal to the number of classes of objects in the image.

Although single encoder-decoder network solutions have achieved great success, they suffer from many problems, which limit their performance of semantic segmentation. In this regard, for traditional single encoder-decoder network, the prediction segmentation result is directly produced by only once encoding-decoding procedure with the supervised learning. However, scenes in the real-world always involve many objects and various categories, which make it difficult to directly fit non-linear relationship and understanding the complex scenes by machines. Once the encoding-decoding procedure is weak for learning representative features from coarse to fine, it would limit the performance of single encoder-decoder network for segmentation.

In summary, the problems of traditional single encoder-decoder networks lie in the following two aspects. Firstly, the features learned by the single encoder-decoder network are weak and not robust. By applying only once encoding-decoding procedure, learning discriminative features for different objects is difficult. That is, the semantic information can’t be extracted adequately, causing the inaccurate segmentation of objects. Secondly, although there is a multi-layer convolution and a multi-layer deconvolution procedure in the single encoder-decoder network, it is difficult to learn discriminative features and perform complex segmentation prediction simultaneously by one encoder-decoder architecture. This limits the performance of segmentation.

This disclosure proposed a solution for solving the problem of traditional convolutional neural network for semantic segmentation. In the proposed solution, a novel architecture of convolutional neural network is configured to extract discriminative features and make segmentation by conducting several encoding-decoding procedures continuously. In this disclosure, the propose solution may be called as Cascaded Encoder-Decoder networks (CEDNet) . In an embodiment, different encoder-decoder networks are cascaded for joint semantic segmentation. Through the former encoder-decoder networks, the input images may be mapped into discriminative features as along with removing redundant information. It is apparent that more representative features with more semantic information will ease the complex segmentation. It would be much easier for fitting relationships between the representative features than fitting relationships between raw images and segmentation results. Besides, the complex scenes understanding procedure can be decomposed into several different hierarchical approximation by different encoder-decoder networks. Thus, the problems analyzed above can be solved effectively through the proposed solution.

The proposed CEDNet can densely cascade several encoder-decoder networks jointly for semantic segmentation. The architecture of the proposed CEDNet includes N encoder-decoder networks, where N can be any natural number large than 1 in principle. Figure 4 illustrates an exemplary convolutional neural network based on cascaded encoder-decoder architecture according to at least some of embodiments of the present invention, where the number of encoder-decoder networks is N=3. It may be implemented in the semantic segmentation circuitry 210, in conjunction with the processor 202 and the memory 204. As shown in Figure 4, the convolutional neural network 400 includes three encoder-decoder networks, denoted as 430, 440, 450, respectively. Each encoder-decoder network may contain one entire encoding-decoding procedure. There are several convolution layers in the encoding procedure (or called downsampling procedure) and several convolution layers in the decoding procedure (or called upsampling procedure) . It can be appreciated that the configurations of each encoder-decoder network can be adjusted to meet practical demands. These would be described in more details below.

Design the architecture of each encoder-decoder network

As described above, the encoding-decoding procedure of each encoder-decoder network can be divided into an encoder network which performs an encoding procedure and thus constructs an encoding path, and a decoder network which performs a decoding procedure and thus constructs a decoding path. In this way, each encoder-decoder network may consist of one encoding path and one decoding path. Moreover, each encoder-decoder network may include entire encoder-decoder architecture. In some embodiments, existing single encoder-decoder networks, such as FCN, SegNet and other existing encoder-decoder networks designed for semantic segmentation, may be viewed as entire encoder-decoder architecture and embedded into the proposed CEDNet.

In an embodiment, each encoder-decoder network can take a similar architecture as illustrated in Figure 3. As illustrated in Figure 4, there are seven convolution blocks (431, 432, …, 437) for each encoder-decoder network. Each convolution block comprises at least one convolution layer for perform convolution with a set of filter banks to produce feature maps. For example, the first convolution block 431 may receive an input image 410, and convolve it with three convolution layers. The produced feature maps are fed into the second block 432 for further convolution. The second convolution block 432 may receive the feature maps from the block 431 and convolve it with three convolution layers. Then, the feature maps produced from the second block 432 are fed into the third block 433 for further convolution.

In each convolution block, the feature maps of the same block may share a same resolution. For example, the feature maps preceded in the convolution layers of block 431 can share a same resolution identical with that of the input image 410. In an example, the convolution layers of block 431 may perform the convolution with a stride of one, except of the final layer of the block 431. At the final layer of block 431, the convolution can be performed with a stride of two. In this way, the resultant feature maps are down-sampled. Alternatively, the down-sampling may be offered with a pooling layer.

The amount of layers in each block can be either consistent or different. In some embodiments, each layer in the decoding path can have a corresponding layer in the encoding path. In this regard, the feature maps processed in blocks in the encoding path can share a same resolution with the feature maps processed in corresponding blocks in the decoding path. For example, the convolution block 431 may correspond to the convolution block 437, block 432 may correspond to block 436. In corresponding blocks, the amount of feature maps with a same resolution may be always same. For example, for the encoding path, the final convolution layers of each block (i.e. block 431, 432, 433) may perform a convolution with stride 2 for downsampling the input feature maps. Alternatively, this downsampling may be implemented by pooling layers. For the decoding path, the first convolution layers of some blocks (e.g. 435, 436, 437) may perform a deconvolution with stride 2 for upsampling.

As mentioned above, the encoder-decoder networks of the proposed CEDNet can adopt architectures of existing encoder-decoder networks, such as FCN, SegNet, and the like. Notably, there are some modified tricks whose effectiveness has been demonstrated by FCN, SegNet and other encoder-decoder based solutions. These tricks may also be integrated in these encoder-decoder networks of the proposed CEDNet. It may also be appreciated that configurations of different encoder-decoder networks of the proposed solution may be different to meet practical demands.

Cascade the encoder-decoder networks

After each encoder-decoder networks have been designed, all the encoder-decoder networks are cascaded to compose the CEDNet. The first encoder-decoder network 430 is configured to extract features. As illustrated in Figure 4, the raw image 410 is input to the first encoder-decoder network 430 for feature extraction. Thus, feature maps 411 is output as the result of the network 430.

Different from a traditional convolutional encoder-decoder network, feature maps 411 are not generated for classifying objects of the raw image. Thereby, the feature maps 411 may be generated without dimension reductionality. For example, the high dimensional feature maps 411 representation needs not to be fed to a classifier that classifies every pixel as that in a traditional convolutional encoder-decoder network. Then, the feature maps 411 consist of a large amount of feature maps, instead of the segmented results as traditional convolutional encoder-decoder networks. The dimension of the feature maps 411 is much higher than that of segmented results (which may equal to the number of classes of objects in the raw image 310) . As such, more representative features with more semantic information will be extracted and remained in the feature maps 411. Meanwhile, through the network 430, redundant information is discarded, and then the extracted features are more discriminative.

The feature maps 411 can be input to a subsequent encoder-decoder network 440 to get the output feature maps 412. The network 440 may be configured to enhance the semantic level of feature maps. As mentioned above, it would be much easier for fitting relationships between the representative features than fitting relationships between raw images and segmentation results. Thereby, it is apparent that the feature maps 412 own higher semantic level compared with the feature maps 411 after multi-stage encoding-decoding of the network 440. More subsequent encoder-decoder networks may be configured to extract more semantic information which benefits the segmentation.

Finally, the feature maps 412 are input to the final encoder-decoder network 450 followed by a 1x1 convolution block 460 for predicting the final segmentation results. The encoder-decoder network 450 is configured to further enhance the semantic level of the feature maps, with reducing their dimension to fit semantic segmentation. For example, if there are 20 classes of objects to be segmented in the raw image 410, 21 scores of feature maps (in which 20 scores are for 20 classes of objects, one score is for the background of the raw image) would be predicted by the 1x1 convolution block 460 by using the output from the encoder-decoder network 450. The predicted segmentation results are output finally. The output may be a processed image in which the objects in the raw images are segmented into different categories which are visualized by different colors (e.g. as shown in 420) . The output image may share a same resolution with the input raw image.

In some embodiments, the final layer 460 may be other kind of object classifier, configured to classify objects by using feature maps. In other embodiments, the 1x1 convolution block 460 may be omitted. In this regard, a final layer of the third encoder-decoder network 450 may be designed to act as an object classifier, for classifying objects by using feature maps output from the former layer.

By cascading several convolutional encoder-decoder networks, the proposed CEDNet is able to remove the redundant information and extract more discriminative features. More high level semantic information can be extracted by deeper encoder-decoder networks. As a result, better segmentation results can be obtained. Besides, the proposed CEDNet allots the feature extraction and segmentation prediction goals to different encoder-decoder networks, which eases semantic segmentation task and helps to break the performance limits of the traditional technique.

Although there are three encoder-decoder networks in the exemplary CEDNet of Figure 4, in some embodiments a CEDNet may contain two encoder-decoder networks. For example, the second encoder-decoder network may be configured to enhance the semantic level of the feature maps, with reducing their dimension to fit semantic segmentation. In other examples, the CEDNet may contain more than three encoder-decoder networks. These encoder-decoder networks are cascaded one by one to enhance the semantic level of the feature maps. However, from a practical perspective, it is necessary for a CEDNet to be efficient in terms of both memory and computation time, by limiting the number of cascaded encoder-decoder networks.

As described above, a convolutional neural network according to the present invention includes a plurality of cascaded encoder-decoder networks. Each encoder-decoder network includes a plurality of convolutional layers. Each convolutional layer includes a convolutional filter or a deconvolution filter. All these filters can be learned for getting an optimal performance for semantic segmentation. For example, when the cascaded architecture of the convolutional neural network is exhibited, parameters of the convolutional filters and deconvolution filters can be configured and optimized through a training procedure.

The cascaded encoder-decoder network can share similar training procedures as traditional convolutional neural networks for segmentation. For example, with reference to Figure 5, a convolutional neural network according to the disclosed technique can be trained through a training procedure 500. Firstly, architecture of a cascaded encoder-decoder network is designed at block 510. The designing involves configurations of each encoder-decoder network and the loss function of the cascaded encoder-decoder network. For example, in the encoding path, VGG-Net, ResNet, Densenet or other architectures of encoder-decoder networks can be applied. The decoding path may own symmetrical architectures with that of the encoding path.

Then, the parameters of convolutional filters and deconvolution filters in each encoder-decoder network are initialized, as shown at block 520. In some embodiments, if there are known pre-trained models, (e.g. on ImageNet) for the architecture adopted for encoding path, the parameters can be initialized by the corresponding pre-trained parameters. Alternatively, the parameters can be initialized randomly. The parameters in the decoding path can be initialized by a bilinear method.

Next, at block 530, the procedure proceeds to prepare a set of training images and their ground truth, which may be segmented results by manual.

With these training images and their ground truth, the cascaded encoder-decoder network may be trained by a forward propagation and a backward propagation at block 540. For example, the parameters may be optimized by a Stochastic Gradient Descent (SGD) algorithm iteratively until convergence.

After the training, the cascaded encoder-decoder network could be applied for semantic segmentation. For example, in an application scenario depicted in Figure 1, scene semantic segmentation may be implemented with the trained cascaded encoder-decoder network. The task of scene semantic segmentation is to segment the objects in the scene images into different categories which are visualized by different colors. When the scene images or the frames of the video are inputted into the trained cascaded encoder-decoder network, forward computations would be conducted from the first encoder-decoder network to the final encoder-decoder network. The predicted segmentation results are output finally. The output images may share a same resolution with the input images generally. The scene semantic segmentation with a cascaded encoder-decoder network may be widely used in practice, such as Advanced Driver Assistance Systems (ADAS) , autonomous vehicles, and the like. Thus, there are broad application scenarios for the disclosed technique.

Compared with traditional techniques of single convolutional encoder-decoder network, better segmentation performance can be achieved by the cascaded encoder-decoder network according to the disclosed technique. Figure 6 shows results of the present invention as compared with an existing leading technique named SegNet. The CamVid road scene dataset (G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database, ” PRL, vol. 30 (2) , pp. 88–97, 2009) is used for evaluation these techniques. The challenge is to segment the road scenes containing 11 classes, such as road, building, cars, pedestrians, signs, poles, side-walk, etc. The experiments are conducted based on the SegNet tutorial (http: //mi. eng. cam. ac. uk/projects/segnet/tutorial. html) . By the cascaded encoder-decoder network according to the present invention, 52.31 mIoU (mean Intersection-over-Union) score is obtained. It is apparent that it outperforms the mIoU of SegNet (which is 50.2) . This can demonstrate the effectiveness of the present invention. From Figure 6, it can be seen that the output image 630 of a proposed CEDNet has a higher boundary delineation accuracy than the output image 620 of SegNet, which demonstrates the effectiveness of the proposed CEDNet for semantic segmentation.

Although in the above application scenario the cascaded encoder-decoder network is used for scene semantic segmentation, it has wide applications in practice. In theory, it can be used for segmenting objects in any kind of raw data material into different categories. The raw data material do not limited to digital image. In an example, the raw data material may be textual data, and the cascaded encoder-decoder network may be used to learn representative features from the raw textual data, so as to facilitate a textual translation from one language to another language. In another example, the raw data material may be audio data, and the cascaded encoder-decoder network may be used to learn representative features from the raw audio data, so as to facilitate speech recognition.

Reference now is made to Figure 7, which is a schematic illustration of method for semantic segmentation with a convolutional neural network according to at least some of embodiments of the present invention. In procedure 710, a plurality of layers of convolution are performed in series on a raw data material via a first encoder-decoder network, to generate a first set of feature maps. The raw data material may be digital image, digital video, a piece of text in one language, or digital video, or the like. The first set of feature maps may be generated without dimension reductionality. In the plurality of layers of convolution on the raw data material, up-sampling may be performed to cause that a resolution of the first set of feature maps is same as a resolution of the raw data material.

In procedure 720, a plurality of layers of convolution are performed in series on the first set of feature maps via a group of middle encoder-decoder networks cascaded with the first encoder-decoder network, to generate a second set of feature maps for classifying objects of the raw data material. The group of middle encoder-decoder networks may comprise one encoder-decoder network or at least two encoder-decoder networks cascaded one-by-one.

Further, objects of the raw data material may be classified by using the second set of feature maps, via a final encoder-decoder network of the group of middle encoder-decoder network. Alternatively, objects of the raw data material may be classified by using the second set of feature maps, via an object classifier coupled with a final encoder-decoder network of the middle encoder-decoder network. The object classifier may be a one-by-one convolution classifier. The object classifier reduces a dimension of the second set of feature maps to a dimension of the objects of the raw data material. In any cases, the dimension of feature maps may be reduced to approach the dimension of the objects of the raw data material, via the final encoder-decoder network.

In the plurality of layers of convolution in at least one of the first encoder-decoder network and the middle encoder-decoder networks, a feature map may be down-sampled through an encoding path, and a feature map may be up-sampled through a decoding path. The architecture of the decoding path can be symmetrical with that of the encoding path.

As described above, Figure 7 illustrates a flowchart of an apparatus, method, and computer program product according to example embodiments of the invention for semantic segmentation with a convolutional neural network. It will be understood that each block of the flowchart, and combinations of blocks in the flowchart, may be implemented by various means, such as hardware, firmware, processor, circuitry, and/or other devices associated with execution of software including one or more computer program instructions. For example, one or more of the procedures described above may be embodied by semantic segmentation circuitry 210 of the apparatus 200, in conjunction with the processor 202 and the memory 204.

In other examples, one or more of the procedures described above may be embodied by computer program instructions. In this regard, the computer program instructions which embody the procedures described above may be stored by a memory device of an apparatus employing an embodiment of the present invention and executed by a processor of the apparatus. As will be appreciated, any such computer program instructions may be loaded onto a computer or other programmable apparatus (e.g., hardware) to produce a machine, such that the resulting computer or other programmable apparatus implements the functions specified in the flowchart blocks. These computer program instructions may also be stored in a computer-readable memory that may direct a computer or other programmable apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture; the execution of which implements the function specified in the flowchart blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operations to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide operations for implementing the functions specified in the flowchart blocks.

Accordingly, blocks of the flowchart support combinations of means for performing the specified functions and combinations of operations for performing the specified functions. It will also be understood that one or more blocks of the flowchart, and combinations of blocks in the flowchart, can be implemented by special purpose hardware-based computer systems which perform the specified functions, or combinations of special purpose hardware and computer instructions.

In some embodiments, certain ones of the operations above may be modified or further amplified. Furthermore, in some embodiments, additional optional operations may be included. Modifications, additions, or amplifications to the operations above may be performed in any order and in any combination.

Many modifications and other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these inventions pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the inventions are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims.

Moreover, although the foregoing descriptions and the associated drawings describe example embodiments in the context of certain example combinations of elements and/or functions, it should be appreciated that different combinations of elements and/or functions may be provided by alternative embodiments without departing from the scope of the appended claims. In this regard, for example, different combinations of elements and/or functions than those explicitly described above are also contemplated as may be set forth in some of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

A system for sematic segmentation with a convolutional neural network, the system comprising:

a first encoder-decoder network, configured to perform a plurality of layers of convolution on a raw data material to generate a first set of feature maps; and

a group of middle encoder-decoder networks cascaded with the first encoder-decoder network, and configured to perform a plurality of layers of convolution on the first set of feature maps to generate a second set of feature maps for classifying objects of the raw data material.
The system according to claim 1, wherein the group of middle encoder-decoder networks comprises one encoder-decoder network or at least two encoder-decoder networks cascaded one-by-one.
The system according to claim 1, wherein a final encoder-decoder network of the group of middle encoder-decoder network is configured to classify objects of the raw data material using the second set of feature maps.
The system according to claim 1, further comprises: an object classifier coupled with a final encoder-decoder network of the group of middle encoder-decoder networks, and configured to classify objects of the raw data material using the second set of feature maps.
The system according to claim 4, wherein the object classifier is a one-by-one convolution classifier, configured to reduce a dimension of the second set of feature maps to a dimension of the objects of the raw data material.
The system according to claim 4, wherein the final encoder-decoder network is further configured to reduce a dimension of feature maps to approach a dimension of the objects of the raw data material.
The system according to claim 1, wherein the first encoder-decoder network is further configured to generate the first set of feature maps without dimension reductionality.
The system according to claim 1, wherein the raw data material is a raw data of an image.
The system according to claim 1, wherein at least one of the first encoder-decoder network and the middle encoder-decoder networks comprises an encoding path for performing down-sampling and a decoding path for performing up-sampling, wherein an architecture of the decoding path is symmetrical with that of the encoding path.
The system according to claim 9, wherein the up-sampling is performed to cause that a resolution of the feature maps is same as a resolution of the raw data material.
A method for sematic segmentation with a convolutional neural network, the method comprising:

performing a plurality of layers of convolution in series on a raw data material via a first encoder-decoder network, to generate a first set of feature maps; and

performing a plurality of layers of convolution in series on the first set of feature maps via a group of middle encoder-decoder networks cascaded with the first encoder-decoder network, to generate a second set of feature maps for classifying objects of the raw data material.
The method according to claim 11, wherein the group of middle encoder-decoder networks comprises one encoder-decoder network or at least two encoder-decoder networks cascaded one-by-one.
The method according to claim 11, further comprises:

classifying objects of the raw data material using the second set of feature maps, via a final encoder-decoder network of the group of middle encoder-decoder networks.
The method according to claim 11, further comprises:

classifying objects of the raw data material using the second set of feature maps via an object classifier coupled with a final encoder-decoder network of the group of middle encoder-decoder networks.
The method according to claim 14, wherein the object classifier is a one-by-one convolution classifier, and classifying the objects comprises:

reducing a dimension of the second set of feature maps to a dimension of the objects of the raw data material.
The method according to claim 14, further comprises:

reducing a dimension of feature maps to approach a dimension of the objects of the raw data material, via the final encoder-decoder network.
The method according to claim 11, wherein the first set of feature maps is generated without dimension reductionality.
The method according to claim 11, wherein the raw data material is a raw data of an image.
The method according to claim 11, wherein performing a plurality of layers of convolution in at least one of the first encoder-decoder network and the middle encoder-decoder networks comprises performing down-sampling in an encoding path, and performing up-sampling in a decoding path,

wherein an architecture of the decoding path is symmetrical with that of the encoding path.
The method according to claim 19, wherein the up-sampling is performed to cause that a resolution of the feature map is same as a resolution of the raw data material.
An apparatus for sematic segmentation with a convolutional neural network, the apparatus comprising:

at least one processor; and

at least one memory including computer program code,

the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

performing a plurality of layers of convolution in series on a raw data material via a first encoder-decoder network, to generate a first set of feature maps; and

performing a plurality of layers of convolution in series on the first set of feature maps via a group of middle encoder-decoder networks cascaded with the first encoder-decoder network, to generate a second set of feature maps for classifying objects of the raw data material.
The apparatus according to claim 21, wherein the group of middle encoder-decoder networks comprises one encoder-decoder network or at least two encoder-decoder networks cascaded one-by-one.
The apparatus according to claim 21, wherein the apparatus is further caused to classify objects of the raw data material using the second set of feature maps, via a final encoder-decoder network of the group of middle encoder-decoder network.
The apparatus according to claim 21, wherein the apparatus is further caused to classify objects of the raw data material using the second set of feature maps via an object classifier coupled with a final encoder-decoder network of the group of middle encoder-decoder networks.
The apparatus according to claim 24, wherein the object classifier is a one-by-one convolution classifier, and the objects are classified by reducing a dimension of the second set of feature maps to a dimension of the objects of the raw data material.
The apparatus according to claim 24, wherein the apparatus is further caused to reducing a dimension of feature maps to approach a dimension of the objects of the raw data material, via the final encoder-decoder network.
The apparatus according to claim 21, wherein the first set of feature maps is generated without dimension reductionality.
The apparatus according to claim 21, wherein the apparatus is caused to perform a plurality of layers of convolution on a raw data material by performing up-sampling to cause that a resolution of the first set of feature maps is same as a resolution of the raw data material.
The apparatus according to claim 21, wherein the raw data material is a raw data of an image.
A computer-readable storage medium carrying one or more sequences of one or more instructions which, when executed by one or more processors, causing an apparatus to perform at least a method of any one of claims 11-20.
An apparatus comprising means for performing a method according to any of claims 12-20.
A computer program product including one or more sequences of one or more instructions which, when executed by one or more processors, cause an apparatus to perform the steps of at least a method of any one of claims 11-20.