WO2019110824A1

WO2019110824A1 - Using silhouette for fast object recognition

Info

Publication number: WO2019110824A1
Application number: PCT/EP2018/084035
Authority: WO
Inventors: Sylvain Bougnoux
Original assignee: Imra Europe S.A.S.
Priority date: 2017-12-07
Filing date: 2018-12-07
Publication date: 2019-06-13
Also published as: JP7317009B2; JP2021511556A; FR3074941A1; FR3074941B1

Abstract

An object recognition method comprising the steps of obtaining an image from an image sensor and a 3D point cloud from a depth sensor; synchronizing the image and the 3D point cloud; 3D point clustering to separate objects from the 3D point cloud; extracting silhouettes by segmentation of the image using the 3D point clustering, and contour detection of the separated objects into the segmented image; recognizing silhouette by transforming each detected contour into a silhouette descriptor, and classifying these silhouette descriptors into recognized objects using a trained neural network for object recognition.

Description

USI NG SI LHO U ETTE FOR FAST OBJ ECT RECOG N ITION

FIELD OF THE INVENTION

The present invention generally relates to machine learning techniques using neural networks. In particular, the present invention relates to an object recognition method using silhouette detection.

Such method is useful especially in the field of human-assisted or autonomous vehicles using sensors for obstacle detection and avoidance, to navigate safely through its environment.

BACKGROUND OF THE INVENTION

According to Wikipedia, autonomous cars are being developed with deep learning, or neural networks. The neural network depends on an extensive amount of data extracted from real life driving scenarios. The neural network is activated and “learns” to perform the best course of action. In addition, sensors, such as the LIDAR sensors already used in self-driving cars; cameras to detect the environment, and precise GPS navigation will be used in autonomous cars.

Despite all the recent improvements made to these new technologies for autonomous cars, several drawbacks remain such as detection and behavior in case of rare or unseen driving situations as well as the necessary compromise between the increasing need of computing power on one hand and the absolute need of a high processing speed of all collected information by the vehicle sensors.

SUMMARY OF THE INVENTION

The present invention aims to recognize objects using silhouette detection, i.e. the ability to recognize and group occlusion contours, which is a key component of human vision to avoid dangers. The aim is to reproduce such key component and offer this ability to computer vision for many applications such as driver’s assistance, automatic driving or robotic in general. According to a first aspect, the invention relates to an object recognition method comprising the steps of:

- obtaining an image from an image sensor and a 3D point cloud from a depth sensor;

- synchronizing the image and the 3D point cloud;

- clustering the 3D points to separate objects from the 3D point cloud;

- extracting silhouettes by

o segmentation of the image using the 3D point clustering, and o contour detection of the separated objects into the segmented image;

- recognizing silhouette by

o transforming each detected contour into a silhouette descriptor, and o classifying these silhouette descriptors into recognized objects using a trained neural network for object recognition.

This method presents several advantages among which robustness thanks to the combined use of the information contained in the image taken by the image sensor and the information contained in the 3D point cloud obtained by the depth sensor, even in the case of bad conditions such as low light conditions. This approach is also generic as object recognition through silhouettes may be applied to all object categories (humans, poles, trees, animals, vehicles, etc.). Computational costs are low, and computation is fast. Indeed, pixel distribution analysis requires a lot of processing, while silhouette recognizing requires much less processing. Although silhouette processing does not provide a full description of a scene, it nevertheless provides a fundamental cue and a core technology for fast detection with good performances for detecting potential danger within a scene.

Advantageously, the image taken by the image sensor is made of a plurality of pixels and the segmentation of the image step comprises the sub-steps of

- graph-cutting each separated object from the 3D point clustering step by o projecting on the image all 3D points from the 3D point clustering step corresponding to the separated object under consideration; o assessing the projected 3D points as belonging either to the separated object under consideration or to a background; o assessing each pixel of the image as belonging either to the separated object under consideration, to a background, or to an unknown state, using a pixel weight based on color difference and/or distance between two neighboring pixels;

o adjusting the pixel weight for each pixel belonging to the unknown state based on its distance to the pixels belonging to the separated objects and its distance to the pixels belonging to the background;

- outputting for each separated object a black and white mask of pixels representative of the background and the separated object under consideration in the form of one or several blobs.

The extraction of the silhouette is done by using the graph-cut technology to perform the segmentation of the image with the collaboration of the 3D point clouds. The introduction of 3D information reveals as a major advantage as complex association in the image can be made, separate a cluttered scene or seen poles in completely saturated location etc.

Advantageously, the contour detection step comprises the sub-steps of:

- assessing for each blob a distance based on the 3D point clustering of the corresponding separated object;

- combining all the blobs in a single image by drawing them from furthest to closest and assigning them a different label for further identification, resulting in a superimposed blobs image;

- extracting the contour from the superimposed blobs image corresponding to separated objects;

- determining fake contour portions for each pixel of the contour assessed with a distance belonging to a closer blob.

Such contour detection using both 2D and 3D information allows to easily separate the real contours from the occlusion fake ones, which is a major importance for object recognition.

Advantageously, the silhouette descriptor is a 1 D descriptor using a constant description length and preferably the descriptor has a reduced length between 100 and 300 float numbers for an image of more than 1 million pixels. Using a constant and reduced length for the descriptor ensures fast recognition and allows to reduce the neural network number of hidden layers, for instance to two layers with respectively 800 and 600 units.

Advantageously, the method further comprising the step of combining the object recognition neural network with at least another trained neural network for object prediction within the image so as to form an end-to-end neural network for object recognition and prediction.

Using silhouette recognition gives flexibility for extending the method to at least another neural network. As a core technology, silhouettes may be used for more elaborated tasks such as danger perception.

According to another aspect, the invention relates to an assisted or autonomous vehicle comprising:

- an image sensor unit arranged to capture an image;

- a depth sensor arranged to obtain a 3D point cloud;

a synchronising unit to (temporally and/or spatially) synchronize the image and the 3D point cloud;

- a processing unit arranged to recognize objects within the image according to the object recognition method of any of claims 1 to 6;

- a control unit arranged to control the vehicle based on the recognized objects.

Advantageously, the assisted or autonomous vehicle further comprises a display unit arranged to display an information related to the recognized objects and / or an assisted or autonomous driving unit arranged to plan a safe path depending on recognized objects; and wherein the control unit is arranged to activate at least one of the display unit and the assisted or autonomous driving unit.

BRIEF DESCRIPTION OF THE DRAWINGS

Other features and advantages of the present invention will appear more clearly from the following detailed description of particular non-limitative examples of the invention, illustrated by the appended drawings where:

- Figure 1 represents an object recognition method according to a first embodiment of the invention; - Figure 2A represents a preferred embodiment for the image segmentation step;

- Figure 2B represents a preferred embodiment for the contour detection step;

- Figure 3 represents a vehicle equipped with the necessary units to implement the method according to the invention

DETAILED DESCRIPTION OF THE INVENTION

Before describing in more details the different embodiments of the present invention, here is a reminder of a silhouette definition, term that will be often used as well as some general consideration about the interest to use silhouettes in computer vision for autonomous car or the like.

A silhouette is a contour turning around a set of pixels; it is the intra border between pixels. The silhouette quantum (smallest element - i.e. 1 pixel long) can only have 2 orientations (vertical or horizontal) and 4 directions (up, right, down, left). Note that a single silhouette can be made of many blobs, e.g. if split by an occultation. We define the interior and the exterior by turning clock-wise, i.e. the interior is on the right side of the run.

The main advantage of silhouette is in the global enhancement of perception in general. It is quite accurate both in 2D & 3D vision, and as being generic we can see all objects as identified cluster and recognize most of them. Perceiving silhouette, i.e. the ability to recognize and group occlusion contours, is a key component of human vision giving the ability to avoid as fast as possible the dangers. It is done through separation and recognition of the dangers. Our aim is to mimic this key component and offer this ability to computer vision for many applications (driver’s assistance, automatic driving, and robotic in general) instead of pixels distribution analysis (the classical approach) which is possible but requires a lot of (time and power) processing. Moreover silhouettes technology can also be the start of other functions (such as action predictions) because it is much easier to understand the pose of humans as input to action recognition.

Using 2D and 3D image/depth sensors collaborate for robustness. While camera (2D vision) will be limited by some external conditions such as low light conditions, the depth sensor (3D vision) will not be able to distinguish a traffic light status for instance.

Figure 1 represents an object recognition method according to a first embodiment of the invention. The object recognition method comprising the steps of:

S1 : obtaining an image from an image sensor and a 3D point cloud from a depth sensor;

S2: synchronizing the image and the 3D point cloud;

S3: clustering 3D points to separate objects from the 3D point cloud;

S4: extracting silhouettes by segmentation of the image using the 3D point clustering (S41 ), and contour detection of the separated objects into the segmented image (S42);

S5: recognizing silhouette by transforming each detected contour into a silhouette descriptor (S51 ), and classifying these silhouette descriptors into recognized objects using a trained neural network for object recognition (S52).

The object separation task is made possible with the help of 3D given by a depth sensor such as a laser light scanning unit (LIDAR) to take a continuous series of 3d point clouds. Namely silhouette extraction is done by using 3D information to separate objects and extracting their contours in an image. Then silhouette recognition is done via a descriptor and a classifier.

The sensor collaborates in the sense that the 3D information is used to indicate where to create and look for objects. Then further information is taken from the image with dense pixel information. For instance, a usual depth sensor such as a 64-planes Lidar allows perceiving with enough confidence a pedestrian until 25m, even without image. When going farther or due to using a Lidar with fewer planes, it is the role of the image to take over.

For clustering 3D points step (S2), one can use any of the known solutions among which:

- the point cloud library (www.pointclouds.org), which is a standalone, large scale, open project for 2D/3D image and point cloud processing;

-“On the Segmentation of 3D LIDAR Point Clouds”, by Douillard et al., presenting in part III, segmentation algorithms with 3D clustering methods; -“Shape-based recognition of 3D point clouds in urban environments”, by Golovinskiy et al., presenting a system for recognizing objects in 3D point clouds of urban environments.

For the silhouette extracting step (S4), one can use several existing solutions among which:

-“Shape feature encoding via Fisher Vector for efficient fall detection in depth- videos”, by Adrian et al., presenting the use of Fisher Vectors for feature extraction.

Figure 2A represents a preferred embodiment for the image segmentation step. Such segmentation step comprises the sub-steps of:

S41 1 : graph-cutting each separated object from the 3D point clustering step by

S41 1 1 : projecting on the image all 3D points from the 3D point clustering step corresponding to the separated object under consideration;

S41 12: assessing the projected 3D points as belonging either to the separated object under consideration, or to a background;

S41 13: assessing each pixel of the image as belonging either to the separated object under consideration, to a background, or to an unknown state, using a pixel weight based on color difference and/or distance between two neighboring pixels;

S41 14: adjusting the pixel weight for each pixel belonging to the unknown state based on its distance to the pixels belonging to the separated objects and its distance to the pixels belonging to the background; and

S412: outputting for each separated object, a black and white mask of pixels representative of the background and of the separated object under consideration in the form of one or several blobs.

The most important point in the use of this graph-cut technology is its efficiency to extract complex shapes, its genericity (for any objects but also in another context such as robotics), and its rapidity due to a limited number of uncertain pixels. More specifically, this section describes the algorithm for extracting the silhouette of the object within an image. We start from the clustering of the objects from the 3D point clouds. The extraction is done using graph-cuts.

A cut is a segmentation of the image assessing each pixel to either the foreground (i.e. any object under consideration) or to the background (i.e. anything else than the object under consideration). To perform the segmentation, the edges of the graph (classically the n-link, ie the segments between pixels) are given weight according to the affinity of being such a label or for neighbor pixels to be given the same label.

The specificity of the graph-cut here is to add some information from the 3D points. Concretely a weight is added to each segment. For that purpose, all the 3D points of an object are selected (thanks to the clustering step). Then these 3D points are converted as a connex 2D area by projecting the 3D points onto the image.

As shown on Figure 4, the idea is to use 3 sets of pixels, the set of foreground pixel (i.e. known as belonging to the object - in green below), the set of background pixel (i.e. known as belonging to the background - in red), and the unknown set (i.e. that can be either foreground or background - in yellow). We start from the 3D points ordered in lines. Basically each line of the 3D points constitutes a line in the image (in green), that we surround by a margin, or because the 3D points explicitly belong to another object, by a red pixel. To make the yellow pixels, we interpolate between each found extremities, and we extrapolate on the top and on the bottom of the object. Overall, we call these 2D pixels (the 3 categories) the preselected pixels. To deal with synchronization issues the red and green sets are slightly minored.

Flereafter we refine the description of our weight model. The graph-cut can separate 2 models, the background and the foreground. Classically each pixel is a vertex of the graph. Neighboring pixels are linked by n-links, weighted by a distance on their respective colors. Most classical distances can be used. Then more importantly each pixel is linked to the two terminal vertices, by a t-link. The weight of each t-link is composed by a term from the distance between the color of the pixel and a color model of the foreground respectively of the background (classically a Gaussian Mixture Model - GMM - of their respective model) and a term from the distances to the closest pixel belonging to the foreground (the green pixels), respectively to the closest pixel belongings to the background (the red pixels). For the 1^st terminal we directly take this distance (pixel, foreground) as a weight. Whereas for the 2^nd terminal, we a distance to the background (pixel, background). We can avoid computing the distance to the background by taking the inverse of the distance to the foreground instead, or reversely. As a refinement for this 2^nd term, the distance on each image direction (horizontal and vertical axis) is weighted by a factor resulting of the 3D points distribution. For instance for Lidar, the horizontal distribution is much denser than the vertical one, therefore the vertical distance is minored compared to the horizontal one.

Then the graph-cut is computed by a max-flow/min-cut algorithm.

The output of the graph-cut is a black and white mask, i.e. an image in full resolution, all black unless the pixels presumably assessed to the object. The mask has no reason to be connex (i.e. made of a single blob), indeed in many situations a shape is made of several blobs. Now we have to turn this mask into a contour representation.

Figure 2B represents a preferred embodiment for the contour detection step. The contour detection step comprises the sub-steps of:

S421 : assessing for each blob a distance based on the 3D point clustering of the corresponding separated object;

S422: combining all the blobs in a single image by drawing them from furthest to closest and assigning them a different label for further identification, resulting in a superimposed blobs image;

S423: extracting the contour from the superimposed blobs image corresponding to separated objects;

S424: determining fake contour portions for each pixel of the contour assessed with a distance belonging to a closer blob.

Indeed, it is really important to distinguish fake contours from real ones. Fake contours are artificial ones due to occultations. In our method, this task turns simple because each blob is assessed a distance based on the 3D points it holds. When extracting the contour, we also ran the exterior contour, when the exterior pixel is assessed to a blob closer, the corresponding frontier is marked as fake. Figure 3 represents a vehicle 100 equipped with at least one camera 200 pointing the road ahead or the environment of the vehicle to take a video or a continuous series of images and with a 360° scanning unit 210, such as a laser light scanning unit (LIDAR) to take a continuous series of 3d point clouds. The vehicle 100 also comprises a processing unit and an electronic control unit (300), a display unit and an autonomous driving unit (400, 410).

The electronic control unit 300 is connected with the autonomous driving unit comprising a steering unit 400 arranged to steer the vehicle, and a movement control unit 410 comprising a power unit, arranged to maintain or increase a vehicle speed and a braking unit arranged to stop the vehicle or to decrease the vehicle speed, so that vehicle 100 might be driven with the method according to the present invention.

It will be understood that various modifications and/or improvements evident to those skilled in the art can be brought to the different embodiments of the invention described in the present description without departing from the scope of the invention defined by the accompanying claims.

Claims

CLAI MS

1. An object recognition method comprising the steps of:

- obtaining an image from an image sensor and a 3D point cloud from a depth sensor (S1 );

- synchronizing the image and the 3D point cloud (S2);

- clustering 3D points to separate objects from the 3D point cloud (S3);

- extracting silhouettes (S4) by

o segmentation of the image using the 3D point clustering (S41 ), and o contour detection of the separated objects into the segmented image (S42);

- recognizing silhouette (S5) by

o transforming each detected contour into a silhouette descriptor (S51 ), and

o classifying these silhouette descriptors into recognized objects using a trained neural network for object recognition (S52).

2. The object recognition method according to claim 1 , wherein the image is made of a plurality of pixels and wherein the segmentation of the image step comprises the sub-steps of:

- graph-cutting (S41 1 ) each separated object from the 3D point clustering step by

o projecting on the image all 3D points from the 3D point clustering step corresponding to the separated object under consideration (S41 1 1 );

o assessing the projected 3D points as belonging either to the separated object under consideration, or to a background (S41 12); o assessing each pixel of the image as belonging either to the separated object under consideration, to a background, or to an unknown state, using a pixel weight based on color difference and/or distance between two neighboring pixels (S41 13); o adjusting the pixel weight for each pixel belonging to the unknown state based on its distance to the pixels belonging to the separated objects and its distance to the pixels belonging to the background (S41 14);

- outputting for each separated object, a black and white mask of pixels representative of the background and of the separated object under consideration (S412) in the form of one or several blobs.

3. The object recognition method according to claim 2, wherein the contour detection step comprises the sub-steps of:

- assessing for each blob a distance based on the 3D point clustering of the corresponding separated object (S421 );

- combining all the blobs in a single image by drawing them from furthest to closest and assigning them a different label for further identification, resulting in a superimposed blobs image (S422);

- extracting the contour from the superimposed blobs image corresponding to separated objects (S423);

- determining fake contour portions for each pixel of the contour assessed with a distance belonging to a closer blob (S424).

4. The object recognition method according to any of claims 1 to 3, wherein the silhouette descriptor is a 1 D descriptor using a constant description length.

5. The object recognition method according to claim 4, wherein the silhouette descriptor has a reduced length.

6. The object recognition method according to any of claims 1 to 5, further comprising the step of:

- combining the object recognition neural network with at least another trained neural network for object prediction within the image so as to form an end-to-end neural network for object recognition and prediction.

7. An assisted or autonomous vehicle (100) comprising: - an image sensor unit (200) arranged to capture an image;

- a depth sensor (210) arranged to obtain a 3D point cloud;

- a processing unit (300) arranged:

o to synchronize the image and the 3D point cloud;

o to recognize objects within the image according to the object recognition method of any of claims 1 to 6;

- a control unit arranged to control the vehicle (100) based on the recognized objects.

8. The assisted or autonomous vehicle (100) according to claim 7, further comprising:

- a display unit arranged to display an information related to the recognized objects; and / or

- an assisted or autonomous driving unit (400, 410) arranged to plan a safe path depending on recognized objects; and

wherein the control unit is arranged to activate at least one of the display unit and the assisted or autonomous driving unit.