EP3555802A1

EP3555802A1 - Object recognition system based on an adaptive 3d generic model

Info

Publication number: EP3555802A1
Application number: EP17811644.8A
Authority: EP
Inventors: Loïc LECERF
Original assignee: Magneti Marelli SpA
Current assignee: Marelli Europe SpA
Priority date: 2016-12-14
Filing date: 2017-11-21
Publication date: 2019-10-23
Also published as: CN110199293A; IL267181B; IL267181A; FR3060170B1; US20190354745A1; IL267181B2; KR102523941B1; WO2018109298A1; FR3060170A1; CA3046312A1; JP7101676B2; KR20190095359A; JP2020502661A; US11036963B2

Abstract

The invention relates to a method for automatically configuring a system for recognizing a class of objects of variable morphology, comprising the following steps: providing a machine learning system with an initial data set (10) sufficient to recognize instances of objects of the class in a sequence of images of a target scene; providing a generic three-dimensional model specific to the class of objects, whose morphology is definable by a set of parameters; acquiring a sequence of images of the scene with the aid of a camera (12); recognizing image instances (14) of objects of the class in the acquired sequence of images using the initial data set; mapping the generic three-dimensional model (16) with recognized image instances (14); recording ranges of variation of the parameters (20) resulting from the mappings of the generic model; synthesizing multiple three-dimensional objects (22) on the basis of the generic model by varying the parameters in the recorded ranges of variation; and completing the data set (10) of the learning system by projections of the synthesized objects (24) in the plane of the images.

Description

OBJECT RECOGNITION SYSTEM BASED ON

AN ADAPTIVE 3D GENERIC MODEL

Technical area

The invention relates to mobile object recognition systems, including systems based on machine learning.

Background

The isolation and tracking of a moving object in a sequence of images can be performed by relatively unsophisticated generic algorithms based, for example, on background subtraction. On the other hand, it is more difficult to classify the objects thus isolated in categories that one wishes to detect, that is to say, to recognize whether the object is a person, a car, a bicycle, an animal, etc. Indeed, the objects can have a great variety of morphologies in the images of the sequence (position, size, orientation, distortion, texture, configuration of possible appendages and articulated elements, etc.). The morphologies also depend on the viewing angle and the lens of the camera that films the scene to watch. Sometimes we also want to recognize subclasses (car model, gender of a person).

To classify and detect objects, a machine learning system is generally used. The ranking is then based on a knowledge base or a data set developed by learning. An initial data set is generally generated during a so-called supervised learning phase, where an operator views sequences of images produced in context and manually annotates the image areas corresponding to the objects to be recognized. This phase is generally long and tedious, because one seeks ideally to capture all the possible variants of the morphology of the objects of the class, at least enough variants to obtain a satisfactory recognition rate.

To alleviate this initial task of supervised learning, we have proposed machine learning techniques where, rather than feeding the dataset with annotated real images, we feed it with self-annotated synthesized images generated from three-dimensional models of the objects to be recognized. Such a technique is described for configuring a pedestrian detector in the article ["Learning Scene-Specifics Pedestrian Detectors Without Real Data", Hironori Hattori et al. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)]. A similar technique is described for configuring a car detector in the article ["Teaching 3D Geometry to Deformable Part Models", Bojan Pepik et al. 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)].

A characteristic of these techniques is that they generate many synthesized images that, although they conform to the parameters and constraints of the 3D models, have improbable morphologies. This clutters the dataset with unnecessary images and can slow recognition.

In addition, some objects have such a variable morphology that it is difficult to realistically reproduce all possibilities with a 3D model having a manageable number of parameters and constraints. This results in gaps in the data set and defects in the detection of certain objects.

summary

There is generally provided a method of automatically configuring a recognition system of a variable morphology object class, comprising the steps of: providing a machine learning system with an initial data set sufficient to recognize instances objects of the class in a sequence of images of a target scene; providing a generic three-dimensional model specific to the class of objects, the morphology of which can be defined by a set of parameters; acquire a sequence of images of the scene using a camera; recognizing image instances of objects of the class in the acquired image sequence using the initial dataset; conforming the generic three-dimensional model to recognized image instances; record ranges of variation of the parameters resulting from conformations of the generic model; synthesize multiple three-dimensional objects from the generic model by varying the parameters in the recorded ranges of variation; and complete the data set of the learning system by projections of the objects synthesized in the plane of the images.

The method may comprise the following steps: defining parameters of the generic three-dimensional model by the relative positions of bitters of a mesh of the model, the positions of the other nodes of the mesh being bound to the bitters by constraints; and perform conformations of the generic three-dimensional model by positioning bitters of a projection of the model in the plane of the images. The method may further include the steps of: registering textures from areas of the recognized image instances; and stamping on each synthesized object a texture among the recorded textures.

The initial data set of the learning system can be obtained by supervised learning involving at least two objects of the class whose morphologies are at opposite ends of a domain of observed variation of the morphologies.

Brief description of the drawings

Embodiments will be set forth in the following description, given as a non-limiting example in relation to the appended figures among which:

• Figure 1 shows a schematic three-dimensional generic model of an object, projected in different positions of a scene seen by a camera; and

FIG. 2 schematically illustrates a configuration phase of a machine learning system for recognizing objects according to the generic model of FIG. 1.

Description of embodiments

To simplify the initial configuration phase of an object detector, it is proposed, like the aforementioned article by Hironori Hattori, to configure a machine learning system using synthesized and self-annotated images from three-dimensional models. However, to improve the recognition rate, the three-dimensional models are obtained from a parameterizable generic model that has been previously conformed to the images of real objects filmed in context.

Figure 1 illustrates a simplified generic model of an object, for example a car, projected onto an image in different positions of an example of a scene monitored by a fixed camera. The scene here is, for simplicity's sake, a street crossing horizontally the field of view of the camera.

In the background, the model is projected in three aligned positions, in the center and near the left and right edges of the image. In the foreground, the model is projected in a slightly left position. All these projections come from the same model from the point of view of dimensions and show the morphological variations of the projections in the image as a function of the position in the scene. In a more complex scene, for example a curved street, one would also see variations of morphology depending on the orientation of the model.

The variations of morphology as a function of the position are defined by the projection of the plane on which the objects evolve, here the street. The projection of the plane of evolution is defined by equations that depend on the characteristics of the camera (angle of view, focal length and distortion of the lens). Edges perpendicular to the camera axis change size homothetically as a function of distance from the camera, and edges parallel to the camera axis follow creepage distances. As a result, during a lateral displacement of a car in the illustrated view, the front face of the car, initially visible, is hidden from the center of the image, while the rear face, initially hidden, becomes visible from the center of the image. The upper face of the car, still visible in this example, deforms in shear according to vanishing lines.

In summary, the projections of the same object in the image at different positions or orientations have a variable morphology, even if the real object has a fixed morphology. Of course, real objects can also have a variable morphology, whether from one object to another (between two cars of different models), or during the movement of the same object (pedestrian). Learning systems are well suited to this situation when they have been configured with enough data to represent the range of most likely projected morphologies.

The envisaged three-dimensional generic model, for example of the Point Distribution Model (PDM) type, may comprise a mesh of nodes linked to each other by constraints, that is to say parameters that establish the relative displacements between adjacent nodes or the deformations of the mesh which cause displacements of certain nodes, known as landmarks. The bitter ones are chosen so that their displacements make it possible to reach all the desired morphologies of the model within the defined constraints.

As shown in Figure 1, in the foreground, a simplified generic car model may include, for the bodywork, a mesh of 16 nodes defining 10 rectangular surfaces and having 10 bitter. Eight bitter KO to K7 define one of the side faces of the car, and the two remaining K8 and K9 landmarks on the other side face define the width of the car. A single bitter would be enough to define the width of the car, but the presence of two bitter or more will allow to conform the model to a projection of a real object taking into account the deformations of the projection. The wheels are a characteristic element of a car and can be assigned a specific set of bitter, not shown, defining the center distance, the diameter and the points of contact with the road. Various constraints can be assigned to these bitter so that the model is consistent with the range of cars to be detected, for example, maintain the parallelism between the two lateral faces; maintain parallelism between the front and rear faces; maintain the parallelism between the edges perpendicular to the lateral faces; ensure that the nodes K3 and K5 are above the nodes K1 and K6; ensure that the nodes K3 and K4 are above the nodes K2 and K5; ensure that the nodes K2 and K3 are to the right of the nodes K1 and K2; ensure that nodes K4 and K5 are to the left of nodes K5 and K6, etc.

As previously indicated, the illustrated generic 3D model is simplistic, to clarify the presentation. In practice, the model will include a finer mesh and to define edges and curved surfaces.

FIG. 2 schematically illustrates a configuration phase of a machine learning system for recognizing cars according to the generic model of FIG. 1, by way of example. The learning system comprises a data set 10 associated with a camera 12 installed for filming a scene to be monitored, for example that of FIG.

The configuration phase can start from an existing dataset, which can be summary and offer only a low recognition rate. This existing dataset may have been produced by a quick and uncomplicated supervised learning. The following steps are used to complete the dataset to achieve a satisfactory recognition rate.

The reconnaissance system is started and starts recognizing and tracking cars in successive images captured by the camera. An image instance of each recognized car is extracted in 14. To simplify the presentation, only one side face of the cars is illustrated in the instances - actually each instance is a perspective projection in which other faces are the most often visible. The camera typically produces multiple images that each contain an instance of the same car at different positions. One can choose the largest instance, which will have the best resolution for subsequent operations.

Then, the generic 3D model is conformed to each instance of image thus extracted. This can be done by a conventional conformation algorithm ("fitting") which seeks, for example, the best matches between the image and the bitters of the model as projected in the plane of the image. It is also possible to use algorithms based on bitter detection, as described, for example, in "One Millisecond Face Alignment with an Ensemble of Regression Trees", Vahid Kazemi et al. IEEE CVPR 2014]. Of course, it is preferable that other faces of the cars are visible in the instances so that the model can be defined in a complete way.

These conformation operations produce 3D models that are supposed to be scaled to real objects. For this, the conformation operations can use the equations of the projection of the plane of evolution of the object. These equations can be determined manually from the camera characteristics and the configuration of the scene, or estimated by the system in a calibration phase using, if necessary, adapted tools such as depth cameras. Knowing that objects evolve on a plane, the equations can be deduced from the variation of the size according to the position of the instances of a tracked object.

At the end of each conformation, 16 is produced in 3D model representing the car recognized scale. The models are illustrated in two dimensions, in correspondence with the extracted lateral faces 14. (Note that the generic 3D model used is conformable to cars as well as vans or even buses, so the system is here rather intended to recognize any four-wheeled vehicle.)

During the conformation step, it is also possible to sample the image zones corresponding to the different faces of the car, and to store these image zones in the form of textures at 18. After a certain acquisition period, will have collected without supervision a multitude of 3D models representing different cars, as well as their textures. If the recognition rate offered by the initial dataset is low, it is sufficient to extend the acquisition time to reach a collection with a satisfactory number of models.

When the model collection 16 is judged to be satisfactory, the models are compared with each other at 20 and there is a range of variation for each bitter. An example of a range of variation for bitter K6 has been illustrated.

The ranges of variation can define relative variations affecting the shape of the model itself, or absolute variations such as the position and orientation of the model. One of the bitter, for example KO, can serve as an absolute reference. It can be assigned ranges of absolute variation that determine the positions and possible orientations of the car in the image. These ranges are in fact not directly deductible from the registered models, since a registered model can come from a single instance chosen on a multitude of instances produced during the displacement of a car. We can estimate the variations of position and orientation by deducing them from the multiple instances of a car followed, without having to carry out a complete conformation of the generic model in each instance.

For a bitter diametrically opposed to KO, for example K6, we can establish a range of variation relative to bitter KO, which determines the length of the car. For another diametrically opposed bitter, for example K8, we can establish a range of variation relative to bitter KO, which determines the width of the car. The range of variation of each of the other bitters can be established relative to one of its adjacent landmarks.

Once the ranges of variation are established, one synthesizes in 22 a multitude of virtual 3D cars from the generic model by varying the bitters in their respective ranges of variation. On each virtual car can also be tackled one of the textures 18. The variations of the bitter can be random, incremental, or a combination of both.

At 24, each synthesized car is projected in the camera image plane to form a self-annotated image instance for completing the data set 10 of the training system. These projections also use the equations of the projection of the plan of evolution of the cars. The same car synthesized can be projected in several positions and different orientations, according to the absolute ranges of variation previously determined. In general, the orientations are correlated to the positions, so that we will not vary the two parameters independently, unless we want to detect abnormal situations, like a car across the road.

If the initial dataset was insufficient, the dataset complemented by this procedure could still have gaps preventing the detection of certain car models. In this case, one can reiterate an automatic configuration phase starting from the completed dataset. This dataset normally offers a recognition rate higher than the initial game, which will lead to the constitution of a more varied collection of models 16, making it possible to refine the parameters variation ranges and to synthesize models 22 at a time. more accurate and varied to feed the dataset 10 again.

As previously indicated, the initial dataset can be produced by simple and fast supervised learning. In such a procedure, an operator views images of the filmed scene and, using a graphical interface, annotates the image areas corresponding to instances of the objects to be recognized. Since the subsequent configuration procedure is based on the morphological variations of the generic model, the operator may wish to annotate the objects exhibiting the most important variations. It can thus annotate at least two objects whose morphologies are at opposite ends of a range of variation that it would have visually observed. The interface can be designed to establish the equations of the projection of the plan of evolution with the assistance of the operator. The interface can then propose to the operator to manually conform the generic model to image areas, offering both an annotation and the creation of the first models in the collection 16.

This annotation phase is summary and rapid, the objective being to obtain a restricted initial dataset allowing the start of the automatic configuration phase that will complete the dataset.

Many variations and modifications of the embodiments described herein will be apparent to those skilled in the art. Although these embodiments relate essentially to the detection of cars, the car is presented only as an example of an object that one wishes to recognize. The principles described are applicable to any object that can be modeled generically, including deformable objects, such as animals or humans.

Claims

claims

A method of automatically configuring a system for recognizing a class of objects of variable morphology, comprising the following steps:

Providing a machine learning system with an initial data set (10) sufficient to recognize instances of objects of the class in a sequence of images of a target scene;

• provide a generic three-dimensional model specific to the class of objects, the morphology of which can be defined by a set of parameters;

• acquire a sequence of images of the scene using a camera (12);

• recognize image instances (14) of objects of the class in the image sequence acquired using the initial dataset;

Conforming the generic three-dimensional model (16) to recognized image instances (14);

• record ranges of variation of the parameters (20) resulting from the conformations of the generic model;

• synthesizing multiple three-dimensional objects (22) from the generic model by varying the parameters in the recorded ranges of variation; and

• complete the data set (10) of the learning system by projections of the synthesized objects (24) in the plane of the images.

The method of claim 1, comprising the steps of:

Defining parameters of the generic three-dimensional model by the relative positions of bitters (K0-K9) of a mesh of the model, the positions of the other nodes of the mesh being bound to the bitters by constraints; and

• Operate conformations of the generic three-dimensional model by positioning bitters of a projection of the model in the plane of the images.

The method of claim 1, comprising the steps of:

Registering textures (18) from areas of the recognized image instances; and

• Plating on each synthesized object (22) a texture among the recorded textures.

The method of claim 1, wherein the initial set of data of the learning system is obtained by supervised learning involving at least two objects of the class whose morphologies are at opposite ends of a domain of variation found in morphologies.