WO2020000431A1

WO2020000431A1 - Method, apparatus and computer readable media for image processing

Info

Publication number: WO2020000431A1
Application number: PCT/CN2018/093833
Authority: WO
Inventors: Yazhao LI
Original assignee: Nokia Technologies Oy; Nokia Technologies (Beijing) Co., Ltd.
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-01-02

Abstract

Embodiments of the present disclosure relate to methods, apparatuses and computer program products for image processing. A method comprises extracting a plurality of features of an image via a convolutional block of a convolutional neural network (CNN), the convolutional block including a plurality of convolutional layers and the plurality of features including position information of an object in the image; selecting features from the plurality features via a capsule layer of the CNN, the features selected maintaining the position information; and generating a detection result of the image based on the selected features.

Description

METHOD, APPARATUS AND COMPUTER READABLE MEDIA FOR IMAGE PROCESSING

FIELD

Non-limiting and example embodiments of the present disclosure generally relate to a technical field of signal processing, and specifically to methods, apparatuses and computer program products for image processing based on a Convolutional Neural Network (CNN) .

BACKGROUND

This section introduces aspects that may facilitate better understanding of the disclosure. Accordingly, the statements of this section are to be read in this light and are not to be understood as admissions about what is in the prior art or what is not in the prior art.

Due to its great power for feature representations, a deep CNN has obtained great success in various tasks such as object detection, semantic segmentation, visual surveillance, Advanced Driver Assistant Systems (ADAS) , human-machine interaction (HMI) , and so on.

A traditional CNN has one or more convolutional layers and one or more subsampling layers. The convolutional layer extracts features via a convolutional operation, while the subsampling layer reduces a dimension of the features via subsampling, e.g., pooling or convolutional operation with a large stride. For some applications, the features extracted by the traditional CNN with the subsampling layer (s) are inadequate for representing an object.

SUMMARY

Various embodiments of the present disclosure mainly aim at providing methods, apparatuses and computer storage media for image processing.

In a first aspect of the disclosure, there is provided a method of image processing. The method comprises: extracting a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers and the plurality of features including position information of an object in the image; selecting one or more features from the plurality features via a capsule layer of the CNN, the one or more features selected being sufficient to maintain the position information; and generating a detection result of the image based on the selected one or more features.

In some embodiments, generating the detection result of the image based on the selected one or more features may comprise: obtaining one or more abstract features from the selected one or more features via one or more further conventional blocks, and/or one or more further capsule layers; and generating a detection result of the image based on the one or more abstract features.

In some embodiments, selecting the one or more features from the plurality features via the capsule layer of the CNN may comprise: generating one or more capsule maps based on the plurality features, each of the capsule maps including a set of capsules containing one or more features of the object; obtaining one or more capsule patches of the one or more capsule maps; obtaining weighted patches based on the one or more capsule patches; constructing updated capsule maps by combining the weighted patches; determining an activation for the updated capsule maps; and outputting the activation as the selected features.

In some further embodiments, generating the one or more capsule maps may comprise: generating each of the one or more capsule maps by conducting a convolutional operation of the extracted plurality of features and a convolutional kernel. In some embodiments, generating each of the one or more capsule maps may further comprise: obtaining a capsule map by applying a transformation matrix to a result of the convolutional operation.

In some embodiments, obtaining the one or more capsule patches of the one or more capsule maps may comprise: obtaining the one or more capsule patches by applying a sliding window with a predetermined stride to the one or more capsule maps.

In some embodiments, determining the activation for the updated capsule maps may comprise: determining a length of each capsule included in the capsule maps.

In some embodiments, generating the detection result of the image based on the selected one or more features may comprise: determining a position of the object in the image based on the selected one or more features; and outputting a coordinate or a bounding box for the object as the detection result.

In some embodiments, the detection result may further comprise one or more of: a category of the object, and a confidence value for the determined position.

In some embodiments, the method may further comprise: determining parameters for the one or more convolutional blocks and the one or more capsule layers via training.

In a second aspect of the present disclosure, there is provided an apparatus for image processing. The apparatus comprises at least one processor; and at least one memory including computer program codes; the at least one memory and the computer program codes are configured to, with the at least one processor, cause the computing device at least to: extract a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers, and the plurality of features including position information of an object in the image; select one or more features from the plurality features via a capsule layer of the CNN, the one or more features being sufficient to maintain the position information; and generate a detection result of the image based on the selected one or more features.

In a third aspect of the present disclosure, there is provided another apparatus for image processing. The apparatus comprises means for extracting a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers, and the plurality of features including position information of an object in the image; means for selecting one or more features from the plurality features via a capsule layer of the CNN, the one or more features selected being sufficient to maintain the position information; and means for generating a detection result of the image based on the selected one or more features. In some embodiments, the means comprises at least one processor; and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.

In a fourth aspect of the disclosure, there is provided a computer program. The computer program comprises instructions which, when executed by an apparatus, causes the apparatus to carry out the method according to the first aspect of the present disclosure.

In a fifth aspect of the disclosure, there is provided a computer readable medium with a computer program stored thereon which, when executed by an apparatus, causes the apparatus to carry out the method of the first aspect of the present disclosure.

In a six aspect of the present disclosure, there is provided a computing device. The computing device comprises the apparatus according to the second or third aspect of the present disclosure.

In a seventh aspect of the disclosure, there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: extracting a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers and the plurality of features including position information of an object in the image; selecting one or more features from the plurality features via a capsule layer of the CNN, the one or more features selected being sufficient to maintain the position information; and generating a detection result of the image based on the selected one or more features.

In an eighth aspect of the present disclosure, there is provided an apparatus for image processing, comprising: at least one processor; and at least one memory including computer program codes; the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus at least to: extract a plurality of features of an image via one or more convolutional blocks of a convolutional neural network (CNN) , each convolutional block including a plurality of convolutional layers, and the plurality of features including position information of an object in the image; select one or more features from the plurality features via one or more capsule layers of the CNN, the one or more features being sufficient to maintain the position information; and generate a detection result of the image based on the selected one or more features.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other aspects, features, and benefits of various embodiments of the present disclosure will become more fully apparent from the following detailed description with reference to the accompanying drawings, in which like reference signs are used to designate like or equivalent elements. The drawings are illustrated for facilitating better understanding of the embodiments of the disclosure and are not necessarily drawn to scale, in which:

FIG. 1 illustrates a architecture of a conventional deep CNN with 3 convolutional blocks and 2 subsampling layers for object detection;

FIG. 2 shows an Autonomous Driving System (ADS) application according to an example embodiment of the present disclosure;

FIG. 3 shows an example architecture of a position sensitive CNN according to an example embodiment of the present disclosure;

FIG. 4 shows a flow chart of a method of image processing based on a CNN according to an example embodiment of the present disclosure;

FIG. 5 shows example operations that may be used for feature selection according to an example embodiment of the present disclosure;

FIG. 6 shows an example process of properties extraction and feature selection using a capsule layer according to an example embodiment of the present disclosure;

FIG. 7 shows a result of detection using a method for image processing according to an example embodiment of the present disclosure;

FIG. 8 shows schematically an example process for image processing with 3 convolutional blocks and 3 capsule layers according to an example embodiment of the present disclosure;

FIG. 9 illustrates a simplified block diagram of an apparatus that may be embodied as/in a computing device according to an example embodiment of the present disclosure; and

FIG. 10 shows an example system which may be utilized for image detection according to an example embodiment of the present disclosure.

DETAILED DESCRIPTION

Hereinafter, the principle and spirit of the present disclosure will be described with reference to illustrative embodiments. It should be understood that all these embodiments are given merely for one skilled in the art to better understand and further practice the present disclosure, but not for limiting the scope of the present disclosure. For example, features illustrated or described as part of one embodiment may be used with another embodiment to yield still a further embodiment. In the interest of clarity, not all features of an actual implementation are described in this specification.

References in the specification to “one embodiment, ” “an embodiment, ” “an example embodiment, ” and the like indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment includes the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.

It shall be understood that although the terms “first” and “second” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and similarly, a second element could be termed a first element, without departing from the scope of example embodiments. As used herein, the term “and/or” includes any and all combinations of one or more of the listed terms.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be liming of example embodiments. As used herein, the singular forms “a” , “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” , “comprising” , “has” , “having” , “includes” and/or “including” , when used herein, specify the presence of stated features, elements, and/or components etc., but do not preclude the presence or addition of one or more other features, elements, components and/or combinations thereof.

As used in this application, the term “circuitry” may refer to one or more or all of the following:

(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and

(b) combinations of hardware circuits and software, such as (as applicable) :

(i) a combination of analog and/or digital hardware circuit (s) with software/firmware and

(ii) any portions of hardware processor (s) with software (including digital signal processor (s) ) , software, and memory (ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and

(c) hardware circuit (s) and or processor (s) , such as a microprocessor (s) or a portion of a microprocessor (s) , that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a computing device.

As used herein, the term “computing device” or “apparatus” refers to any device that is capable of computing and data processing. By way of example rather than limitation, a computing device may include, but is not limited to one or more of a video camera, a still camera, a radar, a LiDAR (Light Detection and Ranging) device, a mobile phone, a cellular phone, a smart phone, voice over IP (VoIP) phones, wireless local loop phones, a tablet, a wearable terminal device, a personal digital assistant (PDA) , portable computers, desktop computer, image capture terminal devices such as digital cameras, gaming terminal devices, a sensor device installed with camera, a vehicle installed with a camera, a drone installed with a camera, and an robot installed with a camera, and the like, or any combination thereof.

As yet another example, in an Internet of Things (IOT) scenario, a computing device or an apparatus may represent a machine or other device that performs monitoring and/or measurements, and transmits the results of such monitoring and/or measurements to another device. The computing device may in this case be a machine-to-machine (M2M) device, which may in a 3GPP context be referred to as a machine-type communication (MTC) device.

Many of the computing devices have an image processing capability. In additional, a computing device or an apparatus may be utilized in, for example, object detection, semantic segmentation, visual surveillance, Autonomous Driving System (ADS) , Advanced Driver Assistant Systems (ADAS) , human-machine interaction (HMI) , and so on. In these applications and various other tasks, a Convolutional Neural Network (CNN) , especially a deep CNN, has obtained great success due to its great power for feature representations.

Traditionally, a deep CNN is constructed by stacks of several convolution blocks with subsampling layers. As an instance, Fig. 1 shows an example of a typical architecture of a deep CNN with 3 convolution blocks 101, 103 and 105, 2 subsampling layers 102 and 104 and a Region Proposal Network (RPN) and detection module 106 for object detection. Each convolutional block comprises a plurality of convolutional layers which extracts features via a convolutional operation. The subsampling layers 102 and 104 reduce a dimension of the features via subsampling, e.g., pooling or convolutional operation with a large stride.

Although subsampling layer is helpful for invariance of the CNN, problems of a traditional CNN with the subsampling layer (e.g., a pooling layer) have been observed. First, due to the subsampling operation, position information, which is important for position sensitive tasks such as object detection in ADAS and semantic segmentation object detection, is discarded by the subsampling layer, which decreases the performance of CNN for position sensitive tasks. Thus, features extracted by a traditional CNN with subsampling layer are inadequate for representing the objects. Second, as a feature selection method and an example implementation for subsampling, pooling usually selects the maximum value (Max pooling) or average value (Average pooling) as an activation of the features, which also causes information losing. Therefore, selecting representative features by simple pooling operation is difficult, which makes it difficult for extracting abundant features to represent various properties of objects. For example, due to simple selection method used in the conventional subsampling layer, selected features may not represent regions of an object well. This kind of CNN without consideration of the position information is called PL-CNN (Position Loss CNN) in the present disclosure.

For position sensitive tasks, it is important to localize objects precisely and their relative and/or absolute locations based for example one or more of longitude, latitude, and altitude/elevation, and/or coordinates. Furthermore, a proper manner for feature extraction and selection is crucial to ensure high-performance of a deep CNN. Features extracted should represent properties of the objects sufficiently. For instance, the features should represent a shape, a position, and a category of an object precisely, when using CNN for object detection.

For an ADS application schematically shown in FIG. 2, precisely detecting surrounding objects 203-205 and their relative and/or absolute locations by a camera 202 installed in a vehicle 201 may help avoiding traffic accidents effectively. For example, locations 213-215 of the objects 203-205 may be detected and feedback to the vehicle 201 to facilitate autonomous driving.

In the present disclosure, a new CNN architecture is proposed, which may be used for position sensitive tasks and other applications, and may be called a Position Sensitive CNN (PS-CNN) . In some embodiments, abundant features are extracted by deep convolutional layers without discarding the position information. For instance, in some embodiments, traditional subsampling layers like pooling layers are removed from the deep CNN, and instead, a capsule based routing method is introduced for better feature selection.

In some embodiments, due to great power of the capsules for capturing various properties of objects, position information of the objects may be maintained for improving performance of the deep CNN for position sensitive tasks. Thus, the aforementioned problem due to lack of position information may be solved.

Example architecture of the proposed PS-CNN 300 is shown schematically in FIG. 3. The subsampling layers 102 and 104 in the conventional PL-CNN shown in FIG. 1 are replaced with

capsule layers

302 and 304 respectively for position information keeping and better feature extraction in FIG. 3. In some embodiments, functions and structures of

convolutional blocks

301, 303, and 305 and the RPN and detection module 306 may be same as that of the convolutional blocks 101, 103 and 105 and the RPN and detection module 106 shown in FIG. 1.

Note that FIG. 3 is provided just for illustration rather than limitation. Therefore, configuration and architecture for the proposed CNN can be adjusted based on demands of different tasks. For instance, more or less convolutional blocks and/or capsule layers may be used in other embodiments based on needs.

FIG. 4 shows a flow chart of a method 400 of image processing based on a CNN, e.g., the PS-CNN 300 shown in FIG. 3. The method 400 may be implemented by a computing device or an apparatus, for example, the vehicle 201 shown in FIG. 2 or an apparatus installed in the vehicle 201. However, it should be appreciated that the method 300 may also be implemented in any computing device or apparatus. In some embodiments, some or all operations of the method 400 may be implemented in a cloud. Just for illustration purpose, and without limitation, the method 400 will be described below with reference to a computing device.

As shown in FIG. 4, at block 410, the computing device extracts a plurality of features of an image (e.g., image 310 in FIG. 3) via a convolutional block of a CNN, e.g., the convolutional block 301 in FIG. 3. The plurality of features represent information of the image, which includes, but not limited to, position information of an object in the image (e.g., objects 321-323 in image 310 in FIG. 3) .

In some embodiments, at block 410, a conventional convolution block in a PL-CNN (e.g., the convolutional block 101 in FIG. 1) may be used for extracting the features. In some embodiments, as in a traditional PL-CNN, the convolution block 301 used at block 410 may include a plurality of convolutional layers 311-314, as shown in FIG. 3, for extracting the features. For example, at block 410, the plurality of features of the image 310 may be extracted via a convolutional operation in each of the convolutional layers 311-314. In some embodiments, the convolution block 301 may further include a non-linear activation operation following the convolutional layers.

At block 420, the computing device selects one or more features, which are sufficient to maintain the position information, from the plurality features via a capsule layer of the CNN 300. For example, the capsule layer 302 in FIG. 3 may be utilized to select features.

FIG. 5 shows example operations 500 that may be performed at block 420 for feature selection. In this example, at block 510, the computing device generates one or more capsule maps based on the plurality features extracted at block 410. Each of the one or more capsule maps includes a set of capsules containing one or more features of the object. To facilitate following descriptions, the plurality of features may be represented as

hereafter, where H, W and C indicate a size of three dimensions of F respectively.

In some embodiments, at block 510, the computing device may generate the one or more capsule maps by splitting the features F into a plurality of groups and considering each group of features as a capsule map.

Alternatively, in another embodiment, the computing device may generate each of the one or more capsule maps by conducting a convolutional operation of the extracted plurality of features and a convolutional kernel, as shown in Equation (1) below:

where

denote the generated i th capsule map, l represents a length of each capsule and l may be larger than 1, and

means a three dimensional real number with sizes of H, W and 1 for the three dimensions respectively. In this case, a capsule map

comprises HxW capsules. In Equation (1) ,

represents the ith convolutional kernel for generating the ith capsule,

represents the convolution operation, and C ₁ represents the total number of capsules maps generated at block 510.

Optionally, in some embodiments, at block 510, the computing device may generate the capsule map by further processing the result of Equation (1) . For illustration rather than limitation, the computing device may apply a transformation matrix to a result of the convolutional operation, i.e., multiply the result of the convolutional operation in Equation (1) with a transformation matrix

as shown Equation (2) below, and consider the output as the capsule map:

where

represents the output of a pose transformation, l ₁ represents the length of the capsule generated,

represents the transformation matrix for each capsule map, and · represents matrix multiplication. This operation enables to obtain pose information of objects.

At block 520, the computing device obtains one or more capsule patches of the one or more capsule maps. The ith capsule patch may be denoted as P _i hereafter.

In an embodiment, each capsule patch may be obtained by selecting a subset of capsules from a capsule map of the one or more capsule maps (for example,

or

) .

Alternatively, in some embodiments, the one or more capsule patches may be obtained based on a sliding window, i.e., by applying a sliding window with a predetermined stride to the one or more capsule maps (for example,

or

) . Each capsule patch may comprise (h x w) capsules, i.e., P _i = {P _i. 1. P _i. 2. L P _i. hxw} . Here, h＜H, and/or w＜W.

At block 530, the computing device obtains weighted patches based on the one or more capsule patches, for example, using Equation (3) :

where P _i, k stands for the kth capsule in the ith patch, c _k represents the k th weight for the kth capsule in each patch for voting, and

represents the ith weighted patch, or in other words, a voting result of patch P _i. It can be observed that the weighted patch P′ _i only comprise one capsule, i.e., it has a smaller size compared with the patchP _i. In this way, subsampling is realized, and at the same time, position information is kept. Note that the weights may be shared for all of the capsule patches in the same capsules maps.

At block 540, the computing device constructs an updated capsule map by combining the weighted patches associated with the capsule map. Since the weighted patch P′ _i has a smaller size compared with the patch Pi, the updated capsule map is smaller than the original capsule map.

At block 550, the computing device determines an activation for the updated capsule maps. Let

denotes the updated capsule map, then the activation operation may be represented as:

S ^A=A (S ^R) =||S ^R|| (4) .

where || || denotes obtaining a length or module of each capsule of the updated capsule map, and

stands for activation value obtained for the input capsule maps.

At block 560, the computing device outputs the activation

as the selected features.

As illustration rather than limitation, FIG. 6 shows an example process of properties extraction and feature selection using the capsule layer (e.g., capsule layer 302 in FIG. 3) . The example process may be implemented at block 420 of FIG. 4 or through blocks 510-560 in FIG. 5. The example process shown in FIG. 6 comprises 4 steps denotes as G-step, W-step, R-step and A-step, respectively.

The G-step is used for generating capsules based on the feature maps F 601 produced by, e.g., the convolution block 301 in FIG. 3. The G-step may be executed by conducting convolution operations for getting C ₁ primary capsule maps 602, for example using Equation (1) , and the generated i th capsule map may be represented as

i=1, 2, ..., C ₁. As analyzed in Hinton’s paper titled “Dynamic routing between capsules” and published in Advances in Neural Information Processing Systems. 2017: at pages 3859-3869, each capsule may contain various properties (e.g., shapes, directions, positions, and so on) of an object (e.g., the objects 321-323 in FIG. 3) by l values, while a traditional feature map only contains one value. Therefore, both the input feature F and the generated capsule map

contain position information of interested objects.

The W-step in FIG. 6 is for pose transformation, or in other words, extracting pose information of objects. In some embodiments, for each capsule map, the computing device may perform the pose transformation via above Equation (2) to obtain another output capsule map

denoted as 603.

The R-step in FIG. 6 is for selecting features, e.g., by routing and voting. As an example, rather than limitation, by applying a sliding window with a predetermined stride to the capsule map

different capsule patch may be obtained for routing and voting. For a patch

included in the capsule map

aprinciple of ‘routing by agreement’ may be adopted for determining a voting weight. Results of the voting may be determined, for example, using above Equation (3) . Updated capsule maps 604 are constructed based on the capsule patches.

A-step in FIG. 6 is for determining the activation of the capsules. For illustration rather than limitation, a length or module of a capsule vector may be utilized to represent the activation value of the capsule. That is, for each capsule, during the A-step, the computing device may obtain the length or module of the capsule vectors using above Equation (4) , and use the activation S ^A 605 which includes one or more selected features as an output of the capsules layer (e.g., capsule layer 302 in FIG. 3) .

Now referring back to FIG. 4. At block 430, the computing device generates a detection result of the image based on the selected one or more features. However, it should be appreciated that, in some embodiments, the computing device may not generate the detection result directly based on the selected features; instead, it may perform further processing of the selected features to obtain one or more abstract features, and then generate the detection result based on the one or more abstract features.

As shown in FIG. 3, in some embodiments, the proposed CNN may comprise a plurality of convolutional blocks, and a plurality of capsule layers, and in such a case, at block 430, the computing device may not generate the detection result directly based on the output of the first capsule layer, but may obtain one or more abstract features from the selected features via one or more of a further conventional block (e.g.,

convolutional block

303, 305 in FIG. 3) and a further capsule layer (e.g., capsule layer 304 in FIG. 3) , and generate the detection result of the image based on the one or more abstract features.

For instance, the computing device may extract deeper hierarchical features by the convolution block 304 in FIG. 3. The convolution block 303 may include several convolutional layers and execute similar convolution operations as that of the convolution block 301.

Alternatively or in addition, the computing device may perform further feature selection using the capsule layer 304 in FIG. 3, in a similar way as that described with reference to FIGs. 5 and 6. The capsule layer 304 helps capturing information on position, shape and/or various other properties of the object. That is, position information can be maintained. It results in better performance than a traditional subsampling method. At the same time, by routing and voting mechanism as described with reference to Equation (3) , features are fused and more representative features can be selected.

In some embodiments, the computing device may further extract features using the convolutional block 305 in FIG. 3. The convolution block 305 may include a stack of several convolution layers followed by an activation function for deeper hierarchical feature extraction. It is apparent that the features extracted by the convolution block 305 may be more abstract and may contain higher semantic information which is more suitable for the final detection.

Then in some embodiments, at block 430 of FIG. 3, the computing device may obtain the detection results based on the more abstract features by, for example, a RPN and a detection module.

As in a PL-CNN based object detection method like Faster RCNN or R-FCN, the RPN in the PS-CNN proposed herein may be used for generating object proposals and regress final bounding boxes with the detection module. In some embodiments, the computing device may output the bounding boxes (e.g., bounding boxed 331-333 in FIG. 3) which contain the objects.

Alternatively or in addition, in some embodiments, at block 430, the computing device may output a coordinate of an object as the detection result. In a further embodiment, the detection result may further comprise a corresponding category of the object (e.g., car, animal, or man) , and/or a confidence value for the determined position of the object. Fig. 7 shows an example of the detection result, where bounding boxes of three objects in an image, categories of the objects and confidence of the detection for each object are output.

For illustration purpose, an example process for object detection with 3 convolutional blocks and 2 capsule layers are schematically shown in FIG. 8. In this example, based on an input image, the computing device extracts (810) features via a first convolutional block; then via a first capsule layer, the computing device selects (820) one or more features. Note that during feature selection, position information of objects in the image is kept. The capsule layer may include operations as shown in FIG. 6. The selected features are further processed (830, 840) via a second convolutional block and a second capsule layer, where similar operations as that of the first convolutional block and the first capsule layer may be performed. More abstract hierarchical features are extracted (850) via a third convolutional block. Then object proposals are generated (860) via RPN, and detection results are obtained (870) via a detection module.

Note that, depending on requirement of a target application, architecture of the PS-CNN proposed herein may be adjusted, for example, to include more or less convolutional blocks and/or capsule layers. Furthermore, proper parameters for image processing using the proposed CNN may be obtained in advance via computer simulation, or via training.

For instance, parameters for implementing the

method

400 or 500 or the PS-CNN according to an embodiment of the present disclosure may include: convolutional filters W _i in Equation (1) , transformation matrix W′ _i in Equation (2) , and voting weights c _k in Equation (3) . In some embodiments, the parameters may further include, but is not limited to: H, W, l, C ₁, h, w and l ₁ in Equations (1) - (4) .

The parameters may be learned iteratively for getting an optimal solution for object detection. For illustration rather than limitation, the convolutional filters and transformation matrix may be updated as in the traditional PL-CNN by using a stochastic gradients descent algorithm. For example, the voting weights c _k may be updated using a routing mechanism proposed in the paper titled “Dynamic routing between capsules” published in Advances in Neural Information Processing Systems. 2017 at pages 3859-3869, by S. Sabour, N. Frosst, and G. Hinton.

For illustration rather than limitation, in some embodiments, designing architecture of a PS-CNN may include determining configuration for each convolution block and capsules layer, as well as the RPN and the detection module. For instance, one of the Inception Block, Densenet Block, VGG Block, and the ResNet Block may be adopted as a basic convolution block (e.g.,

convolutional block

301, 303, and 305) of the PS-CNN. However, it should be appreciated that embodiments are not limited thereto, and other convolutional blocks may be used based on needs.

In some embodiments, the backbone network, without RPN and detection module but with fully connection layers, may be pre-trained, e.g., in the ImageNet for image classification task. Then the pre-trained parameters are used for initializing the PS-CNN for object detection and/or semantic segmentation.

In some embodiments, a set of training images and their labels are predetermined, and are input to the PS-CNN for training. Then the PS-CNN may be trained by forward propagation (e.g., by performing method 400) and backward propagation for determining gradients of parameters. Such process may be performed iteratively until results converge, to obtain optimized parameters. Note that embodiments are not limited to any specific way for the training.

For fair comparison, a detection backbone network VGG-16 as well as RPN and detection module same as that in Faster RCNN are used in the evaluation, but the subsampling layers are replaced with the capsules layer proposed in the present disclosure. Results of the evaluation show that the proposed image processing solution using the PS-CNN achieves 46.3%mAP for object detection, which is much better than the performance of faster RCNN (which only achieves a 42.7%mAP) . In addition, when compared with an R-FCN with similar backbone network, 3.2%mAP outperformance can be obtained by the proposed PS-CNN. The great outperformance demonstrates high effectiveness of the solution proposed herein. In other words, the evaluation results show that proposed method based on the PS-CNN achieves better detection performance than conventional PL-CNNs (e.g., R-FCN and faster RCNN) due to more precise position information and better selected features obtained by the capsules layer (s) .

As described above, a task of object detection is to detect objects in a scene image or video. Forward computation as described with reference to FIGs. 4-8 may be conducted based on a well-trained PS-CNN, with the scene image or a frame of the video as an input. Then, the PS-CNN outputs coordinates of the objects, and optionally categories and confidence. In addition, for visualization, the objects may be surrounded by a rectangular box, which is also called a bounding box herein, precisely.

The object detection with the proposed PS-CNN may be widely used in practice, e.g., in ADAS, autonomous vehicles, and so on. However, it should be appreciated that object detection is just an instance of position sensitive tasks. The proposed solution is also suitable for other position sensitive tasks or semantic segmentation with adjustment to the final RPN and detection module if necessary. Therefore, embodiments of the present disclosure may be applied to broad application scenarios.

In some embodiments, an apparatus which may be implemented in/as a computing device is provided. The computing device may include, but is not limited to, a camera device, a vehicle installed with the camera device, a vehicle with ADAS, a vehicle with autonomous driving system, a drone installed with the camera device, and an industrial robot with the camera device. The apparatus may be used for image processing and comprises: means for extracting a plurality of features of an image via a convolutional block of a CNN, wherein the convolutional block includes a plurality of convolutional layers, and the plurality of features represent position information of an object in the image; means for selecting one or more features from the plurality features via a capsule layer of the CNN, wherein the one or more features are sufficient to maintain the position information; and means for generating a detection result of the image based on the selected one or more features. For instance, and without limitation, in some embodiments, the means may the means comprise at least one processor; and at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.

FIG. 9 illustrates a simplified block diagram of another apparatus 900 that may be embodied in/as a computing device or an apparatus which may include, but is not limited to, a camera device, a vehicle installed with the camera device, a drone installed with the camera device, an industrial robot with the camera device, etc.

As shown by the example of FIG. 9, apparatus 900 comprises a processor 910 which controls operations and functions of apparatus 900. For example, in some embodiments, the processor 910 may implement various operations by means of instructions 930 stored in a memory 920 coupled thereto. The memory 920 may be any suitable type adapted to local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory, as non-limiting examples. In some example embodiments the memory 920 can be a non-transitory computer readable medium. Though only one memory unit is shown in FIG. 9, a plurality of physically different memory units may exist in apparatus 900.

The processor 910 may be any proper type adapted to local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) , central processing units (CPUs) , field-programmable gate arrays (FPGA) , application specific circuits (ASIC) , GPUs (Graphics Processing Unit) , NPUs (Neural Network Processing Unit) , AI (Artificial Intelligence) accelerators and processors based on multicore processor architecture, as non-limiting examples. The apparatus 900 may also comprise a plurality of processors 910 in any combination thereof.

The processors 910 may also be coupled with one or more radio transceiver 940 which enables reception and transmission of information over wireless communication means.. In some embodiments, the radio transceiver (s) 940 may comprise wireless communication means (e.g. wireless networking means, wireless telecommunication means, means for communicating according to Long Term Evolution (LTE) , the fifth generation (5G) communication, Narrow Band Internet of Things (NB-IoT) , Long Range Wide Areas Network (LoRaWAN) , Dedicated short-range communications (DSRC) , and/or Wireless Local Area Network (WLAN) , communication standards, etc. ) which allows the apparatus 900 to communicate with other devices/apparatuses, for example, in vehicle-to-vehicle (V2V) , vehicle-to-anything (V2X) , peer-to-peer (P2P) , etc. manners, and send and receive image detection related information.

In some embodiments, the processor 910 and the memory 920 may operate in cooperation to implement any of the

methods

400 or 500 described with reference to FIGs. 3-7. It shall be appreciated that all the features described above with reference to FIGs. 3-8 also apply to apparatus 900, and therefore will not be detailed here.

Various embodiments of the present disclosure may be implemented by a computer program or a computer program product executable by one or more of the processors (for example processor 910 in FIG. 9) , software, firm ware, hardware or in a combination thereof.

Although some embodiments are described in the context of object detection, it should not be construed as limiting the spirit and scope of the present disclosure. The principle and concept of the present disclosure may be more generally applicable to semantic segmentation, and other position sensitive application scenarios.

In addition, the present disclosure may also provide a carrier containing the computer program as mentioned above (e.g., computer instructions/grogram 930 in FIG. 9) . The carrier includes a computer readable storage medium. The computer readable storage medium may include, for example, an optical compact disk or an electronic memory device like a RAM (random access memory) , a ROM (read only memory) , Flash memory, magnetic tape, CD-ROM, DVD, Blue-ray disc and the like.

FIG. 10 depicts an example of a system 1000 including a machine learning model according to an embodiment of the present disclosure. The system 1000 may be mounted in a vehicle 1090, such as a car or truck, although the system 1000 may be used without the vehicles 1090 as well. The vehicle 1090 may be considered as an example of an apparatus according to an embodiment of the present disclosure, and may be used, for example, in an ADS application illustrated in FIG. 2.

As shown in FIG. 10, the example system 1000 includes one or more sensors 1005 (e.g., a camera) and a CNN 1010 or any other machine learning algorithm or any combination thereof, in accordance with some example embodiments. In some embodiments, the CNN 1010 may include one or more convolutional blocks and one or more capsule layers, as shown in FIG. 3.

The system 1000 may also include one or more radio frequency transceivers 1015. In some embodiments, the radio frequency transceiver 1015 may include wireless communication means (e.g. wireless networking means, wireless telecommunication means, means for communicating according to LTE, 5G, NB-IoT, LoRaWAN, DSRC, and/or WLAN standards, etc. ) which allows the system 1000 or the vehicle 1090 to communicate with other one or more devices, apparatus or vehicles or any combination thereof for example in V2V, V2X, P2P, etc. manners, and send and receive image detection related information.

The sensor 1005 may comprise at least one image sensor configured to provide image data, such as image frames, video, pictures, and/or the like. In the case of advanced driver assistance systems and/or autonomous vehicles for example, the sensor 1005 may comprise a camera, a Lidar (light detection and ranging) sensor, a millimeter wave radar, an infrared camera, and/or other types of sensors.

In some example embodiments, the system 1000 may include (but is not limited to) a location detection and determination system, such as a Global Navigation Satellite (GNSS) System with its subsystems, for example, Global Position System (GPS) , GLONASS, BeiDou Navigation Satellite System (BDS) and Galileo Navigation Satellite System etc.

Alternatively or in addition, in some example embodiments, the system 1000 may be trained to detect objects, such as people, animals, other vehicles, traffic signs, road hazards, and/or the like according to, for example, method 400. For instance, with the system 1000, the vehicle 1090 may detect objects 203-205 in FIG. 2 and their relative and/or absolute locations (e.g., longitude, latitude, and altitude/elevation, and/or coordinate) .

In the advanced driver assistance system (ADAS) , when an object is detected, such as a vehicle/person, an output such as a warning sound, haptic feedback, indication of recognized object, or other indication may be generated to for example warn or notify a driver. In the case of an autonomous vehicle including system 1000, the detected objects may signal control circuitry to take additional action in the vehicle (e.g., initiate breaking, acceleration/deceleration, steering and/or some other action) . Moreover, the indication may be transmitted to other vehicles, IoT devices or cloud, mobile edge computing (MEC) platform and/or the like via radio transceiver 1015.

For illustration rather than limitation, the CNN 1010 may be implemented in at least one CNN circuitry, in accordance with some example embodiments. The CNN circuitry may represent dedicated CNN circuitry configured with a neighbor-based activation function, g, taking into account neighbors. The dedicated CNN circuitry may provide a deep CNN. Alternatively or additionally, the CNN 1010 or the CNN circuitry may be implemented in other ways such as, using at least one memory including program code which when executed by at least one processor provides the CNN 1010. In some embodiments, the CNN circuitry may implement one or more embodiments for image detection described with reference to FIGs. 3-8.

In some example embodiments, the system 1000 may have a training phase within the system 1000. The training phase may configure the CNN 1010 to learn to detect and/or classify one or more objects of interest. Referring to the previous example, the CNN circuitry may be trained with images including objects such as people, other vehicles, road hazards, and/or the like. Once trained, when an image includes the object (s) , the trained CNN 1010 may detect the object (s) and provide an indication of the detection/classification of the object (s) . In the training phase, the CNN 1010 may learn its configuration (e.g., parameters, weights, and/or the like) . Once trained, the configured CNN can be used in a test or operational phase to detect and/or classify patches or portions of an unknown, input image and thus determine whether that input image includes an object of interest or just background (i.e., not having an object of interest) . In some other example embodiments, the training phase can be executed out of the system 1000, for example in a cloud system, wherein the system and the cloud are connected over wired and/or wireless network communication means. In some other alternative embodiments, the training phase can be divided between the system 1000 and the cloud system.

The techniques described herein may be implemented by various means so that an apparatus implementing one or more functions of a corresponding apparatus described with an embodiment comprises not only prior art means, but also means for implementing the one or more functions of the corresponding apparatus and it may comprise separate means for each separate function, or means that may be configured to perform two or more functions. For example, these techniques may be implemented in hardware (e.g., circuit or a processor) , firmware, software, or combinations thereof. For a firmware or software, implementation may be made through modules (e.g., procedures, functions, and so on) that perform the functions described herein.

Some example embodiments herein have been described above with reference to block diagrams and flowchart illustrations of methods and apparatuses. It will be appreciated that each block of the block diagrams and flowchart illustrations, and combinations of blocks in the block diagrams and flowchart illustrations, respectively, may be implemented by various means including computer program instructions. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified in the flowchart block or blocks.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any implementation or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular implementations. Certain features that are described in this specification in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.

It will be obvious to a person skilled in the art that, as the technology advances, the inventive concept may be implemented in various ways. The above described embodiments are given for describing rather than limiting the disclosure, and it is to be understood that modifications and variations may be resorted to without departing from the spirit and scope of the disclosure as those skilled in the art readily understand. Such modifications and variations are considered to be within the scope of the disclosure and the appended claims. The protection scope of the disclosure is defined by the accompanying claims.

Claims

An apparatus for image processing, comprising:

at least one processor; and

at least one memory including computer program codes;

the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus at least to:

extract a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers, and the plurality of features including position information of an object in the image;

select one or more features from the plurality features via a capsule layer of the CNN, the one or more features being sufficient to maintain the position information; and

generate a detection result of the image based on the selected one or more features.
The apparatus of Claim 1, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to

generate the detection result of the image, further comprising:

obtain one or more abstract features from the selected one or more features via at least one of:

one or more further conventional blocks, and

one or more further capsule layers; and

generate the detection result of the image based on the one or more abstract features.
The apparatus of Claim 1, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to

select the one or more features from the plurality features, further comprising:

generate one or more capsule maps based on the plurality features, each of the capsule maps including a set of capsules containing one or more features of the object;

obtain one or more capsule patches of the one or more capsule maps;

obtain weighted patches based on the one or more capsule patches;

construct updated capsule maps by combining the weighted patches;

determine an activation for the updated capsule maps; and

output the activation as the selected features.
The apparatus of Claim 3, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to

generate the one or more of capsule maps, further comprising:

generate each of the capsule maps by conducting a convolutional operation of the extracted plurality of features and a convolutional kernel.
The apparatus of Claim 4, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to

generate each of the one or more capsule maps, further comprising:

obtain a capsule map by applying a transformation matrix to a result of the convolutional operation.
The apparatus of Claim 3, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to obtain the one or more capsule patches further comprising:

apply a sliding window with a predetermined stride to the one or more capsule maps.
The apparatus of Claim 3, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to determine the activation for the updated capsule maps further comprising:

determine a length of each capsule included in the capsule maps.
The apparatus of Claim 2, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus to generate the detection result of the image, further comprising:

determine a position of the object in the image based on the selected one or more features; and

output a coordinate or a bounding box for the object.
The apparatus of Claim 1, wherein the detection result further comprises one or more of:

a category of the object, and

a confidence value for the determined position.
The apparatus of Claim 9, wherein the at least one memory and the computer program codes are configured to, with the at least one processor, further to:

determine parameters for the one or more convolutional blocks and the one or more capsule layers via training.
An apparatus for image processing, comprising:

means for extracting a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers, and the plurality of features including position information of an object in the image;

means for selecting one or more features from the plurality features via a capsule layer of the CNN, the one or more features selected being sufficient to maintain the position information; and

means for generating a detection result of the image based on the selected one or more features.
The apparatus of claim 11, wherein the means comprises

at least one processor; and

at least one memory including computer program code, the at least one memory and computer program code configured to, with the at least one processor, cause the performance of the apparatus.
A method of image processing, comprising:

extracting a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers and the plurality of features including position information of an object in the image;

selecting one or more features from the plurality features via a capsule layer of the CNN, the one or more features selected being sufficient to maintain the position information; and

generating a detection result of the image based on the selected one or more features.
The method of Claim 13, wherein generating the detection result of the image based on the selected one or more features comprises:

obtaining one or more abstract features from the selected one or more features via at least one of:

one or more further conventional blocks, and

one or more further capsule layer; and

generating a detection result of the image based on the one or more abstract features.
The method of Claim 13, wherein selecting the one or more features comprises:

generating one or more capsule maps based on the plurality features, each of the capsule maps including a set of capsules containing one or more features of the object;

obtaining one or more capsule patches of the one or more capsule maps;

obtaining weighted patches based on the one or more capsule patches;

constructing updated capsule maps by combining the weighted patches;

determining an activation for the updated capsule maps; and

outputting the activation as the selected features.
The method of Claim 15, wherein generating the one or more capsule maps comprises:

generating each of the one or more capsule maps by conducting a convolutional operation of the extracted plurality of features and a convolutional kernel.
The method of Claim 16, wherein generating each of the one or more capsule maps further comprises:

obtaining a capsule map by applying a transformation matrix to a result of the convolutional operation.
The method of Claim 15, wherein obtaining the one or more capsule patches of the one or more capsule maps comprises:

obtaining the one or more capsule patches by applying a sliding window with a predetermined stride to the one or more capsule maps.
The method of Claim 15, wherein determining the activation for the updated capsule maps comprises:

determining a length of each capsule included in the capsule maps.
The method of Claim 13, wherein generating the detection result of the image based on the selected one or more features comprises:

determining a position of the object in the image based on the selected one or more features; and

outputting a coordinate or a bounding box for the object as the detection result.
The method of Claim 20, wherein the detection result further comprises one or more of:

a category of the object, and

a confidence value for the determined position.
The method of Claim 13, further comprising:

determining parameters for the one or more convolutional blocks and the one or more capsule layers via training.
A computer readable medium having a computer program stored thereon which, when executed by at least one processor of a device, causes the device to carry out the method of any of claims 12-22.
A non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following:

extracting a plurality of features of an image via a convolutional block of a convolutional neural network, CNN, the convolutional block including a plurality of convolutional layers and the plurality of features including position information of an object in the image;

selecting one or more features from the plurality features via a capsule layer of the CNN, the one or more features selected being sufficient to maintain the position information; and

generating a detection result of the image based on the selected one or more features.
A computing device, comprising an apparatus according to any of Claims 1 to 12.
The computing device of Claim 25, wherein the computing device includes one of:

a camera device,

a vehicle installed with the camera device,

a drone installed with the camera device, and

a robot with the camera device.
The apparatus for image processing of Claim 1, wherein the apparatus includes one of:

a camera device,

a vehicle installed with the camera device,

a drone installed with the camera device, and

a robot with the camera device.
An apparatus for image processing, comprising:

at least one processor; and

at least one memory including computer program codes;

the at least one memory and the computer program codes are configured to, with the at least one processor, cause the apparatus at least to:

extract a plurality of features of an image via one or more convolutional blocks of a convolutional neural network (CNN) , each convolutional block including a plurality of convolutional layers, and the plurality of features including position information of an object in the image;

select one or more features from the plurality features via one or more capsule layers of the CNN, the one or more features being sufficient to maintain the position information; and

generate a detection result of the image based on the selected one or more features.