FR3104292A1

FR3104292A1 - Method of configuring an imaging device of a motor vehicle comprising an optical image capture device

Info

Publication number: FR3104292A1
Application number: FR1913748A
Authority: FR
Inventors: Thomas Hannagan; Thibault Fouqueray
Original assignee: PSA Automobiles SA
Current assignee: PSA Automobiles SA
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2021-06-11
Anticipated expiration: 2039-12-04
Also published as: FR3104292B1

Abstract

Le procédé comprend l’entraînement d’un réseau de neurones RNe1 associé au dispositif optique C1 comportant, pour chaque indice i :A) fourniture de N images IMn,i respectivement à N réseaux de neurones RNen, incluant le réseau cible RNe1 et N-1 réseaux annexes RNe2,…, RNeN, préalablement capturées par N dispositifs optiques Cn, incluant un dispositif de référence C1’ et N-1 dispositifs annexes Cn montés de façon décalée par rapport à C1’ ;B) codage de l’image IMn,i par chaque réseau RNen,C) pour chacune des N-1 paires d’images (IM1,i, IMn,i) avec n allant de 2 à N,- calcul de la distance entre les deux codes CD1,i et CDn,i, - prédiction d’une classe par classification binaire de ladite distance, et- calcul d’une erreur entre la classe prédite et une classe réelle ; D) ajustement des poids de connexion entre neurones des réseaux RNe1 et RNen, en fonction de ladite erreur. Figure à publier avec l’abrégé : Fig. 2The method comprises the training of a neural network RNe1 associated with the optical device C1 comprising, for each index i: A) supply of N images IMn, i respectively to N neural networks RNen, including the target network RNe1 and N- 1 ancillary networks RNe2,…, RNeN, previously captured by N optical devices Cn, including a reference device C1 'and N-1 ancillary devices Cn mounted offset from C1'; B) coding of the image IMn, i by each network RNen, C) for each of the N-1 pairs of images (IM1, i, IMn, i) with n ranging from 2 to N, - calculation of the distance between the two codes CD1, i and CDn, i, - prediction of a class by binary classification of said distance, and - calculation of an error between the predicted class and a real class; D) adjustment of the connection weights between neurons of the networks RNe1 and RNen, as a function of said error. Figure to be published with the abstract: Fig. 2

Description

Method of configuring an imaging device of a motor vehicle comprising an optical image capture device

La présente invention concerne de manière générale un procédé de configuration d’un dispositif d’imagerie d’un véhicule automobile comportant un dispositif optique C1 et un réseau de neurones encodeur RNe₁associé.The present invention relates generally to a method of configuring an imaging device of a motor vehicle comprising an optical device C1 and an associated _{encoder neural network RNe 1.}

ART ANTÉRIEURPRIOR ART

Dans le domaine de l’automobile, de plus en plus de véhicules sont désormais équipés de caméras dites «intelligentes», capturant des images de l’environnement autour du véhicule et utilisant des réseaux de neurones profonds pour la détection et la reconnaissance d’objets. Ces réseaux de neurones sont de type "profonds" : ils possèdent de nombreuses couches de neurones, et sont entraînés sur des millions de paires d’entrée-sortie, destinées à l’apprentissage, afin de développer les performances jugées suffisantes pour réaliser une tâche souhaitée, par exemple la reconnaissance générique d’un objet tel qu’un piéton. Actuellement, la reconnaissance générique d’objet utilise des réseaux de neurones à convolution dits résiduels, tel que ResNeXt101 (Xie, Girshick, Dollár, Tu, & He (2017) Aggregated Residual Transformations for Deep Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)).In the automotive field, more and more vehicles are now equipped with so-called "intelligent" cameras, capturing images of the environment around the vehicle and using deep neural networks for the detection and recognition of objects. . These neural networks are of the "deep" type: they have many layers of neurons, and are trained on millions of input-output pairs, intended for learning, in order to develop the performances deemed sufficient to carry out a task. desired, for example generic recognition of an object such as a pedestrian. Currently, generic object recognition uses so-called residual convolutional neural networks, such as ResNeXt101 (Xie, Girshick, Dollár, Tu, & He (2017) Aggregated Residual Transformations for Deep Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)).

Ces réseaux de neurones doivent être préalablement entraînés, lors d’une phase d’apprentissage. Généralement, l’apprentissage des réseaux de neurones pour caméras intelligentes est réalisé de manière supervisée.These neural networks must be trained beforehand, during a learning phase. Typically, learning of smart camera neural networks is done in a supervised manner.

L'idée de l'apprentissage supervisé d’un réseau de neurones est de fournir de nombreuses paires d’entraînement, ou paires d’entrée-sortie, chaque paire d’entraînement contenant des données d’entrée et des données de sortie connues, et d’ajuster les poids de connexions entre neurones afin de minimiser l'expression de l'erreur en sortie du réseau de neurones. Dans un apprentissage supervisé, le réseau de neurones est ainsi entraîné et formé en fournissant des paires de données d'entrée et de données de sortie adaptées, dans le but que le réseau de neurones fournisse une sortie souhaitée pour une entrée donnée.The idea of supervised learning of a neural network is to provide many training pairs, or input-output pairs, with each training pair containing known input data and known output data, and to adjust the weights of connections between neurons in order to minimize the expression of the error at the output of the neural network. In supervised learning, the neural network is thus trained and trained by providing pairs of matched input data and output data, with the aim of the neural network providing a desired output for a given input.

L’apprentissage supervisé d’un réseau de neurones est très puissant mais également très coûteux en temps humain d’analyse, d’annotation et de vérification des données. Il nécessite la production d’une gigantesque base de données contenant des paires d’entrée-sortie connues, analysées et labélisées (ou étiquetées) par des experts humains. Par exemple, entrainer un réseau de neurones à détecter des véhicules demande des millions d’images de caméra sur lesquelles les véhicules ont déjà été détectés par des experts humains, et entourés dans l’image par des boîtes englobantes. Ce travail de labélisation par les experts humains est coûteux et, dans cette approche entièrement supervisée, la quantité et la qualité des labels (ou étiquettes) est déterminante pour les performances du réseau de neurones.Supervised learning of a neural network is very powerful but also very expensive in human time to analyze, annotate and verify data. It requires the production of a gigantic database of known input-output pairs, analyzed and labeled (or tagged) by human experts. For example, training a neural network to detect vehicles requires millions of camera images on which vehicles have already been detected by human experts, and surrounded in the image by bounding boxes. This labeling work by human experts is expensive and, in this fully supervised approach, the quantity and quality of the labels (or labels) is decisive for the performance of the neural network.

Les réseaux de neurones profonds offrent d’excellentes performances en reconnaissance visuelle. Cependant, ces performanes sont limitées du fait de l’entrainement supervisé qui introduit un biais vers un certain type de solution : le réseau cherche à dégager les combinaisons non-linéaires de pixels, connues sous le nom de « codes » ou « features », les plus utiles pour résoudre la tâche, sans chercher à se former une représentation préalable de l’objet qui soit indépendante de la tâche. Il en résulte que les réseaux de neurones profonds sont vulnérables à des attaques par images antagonistes ou adverses qui seraient anodines pour l’œil humain, comme expliqué dans le document «Eykholt, Evtimov, Fernandes, Li, Rahmati, Xiao, Prakash, Kohno, & Song (2018) Robust Physical-World Attacks on Deep Learning Visual Classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)». Ainsi, les réseaux de neurones profonds s’adaptent difficilement à de nouvelles tâches, pour lesquelles ils n’ont pas été spécifiquement entrainés, et sont donc généralement peu versatiles. En outre, cet entrainement supervisé s’avère également limiter les performances des réseaux profonds même sur la tâche cible de l’apprentissage, comme expliqué dans le document «Arandjelovic & Zisserman (2018) Objects that sound. European Conference on Computer Vision (ECCV)». Un réseau profond préalablement entraîné de manière auto-supervisée, avant d’être entrainé sur une tâche de classification visuelle par supervision classique, dépasse les performances d’un réseau de même taille entrainé uniquement par supervision. Plus précisément, le modèle décrit par Arandjelovic & Zisserman est composé de deux réseaux (visuel et auditif), et entrainé à déterminer si des données d’entrée audios et visuelles proviennent de la même séquence vidéo ou non. Les labels (ou étiquettes) pour cette tâche sont binaires, de manière à indiquer une correspondance ou un non-correspondance, et sont construits de manière automatisée à partir de vidéos obtenues sur le site «YouTube», sans nécessiter de décision humaine. Ce type d’apprentissage est dit «auto-supervisé».Deep neural networks provide excellent performance in visual recognition. However, these performanes are limited due to the supervised training which introduces a bias towards a certain type of solution: the network seeks to identify the non-linear combinations of pixels, known under the name of “codes” or “features”, most useful for solving the task, without seeking to form a prior representation of the object that is independent of the task. As a result, deep neural networks are vulnerable to attacks by antagonistic or adverse images which would be harmless to the human eye, as explained in the document “Eykholt, Evtimov, Fernandes, Li, Rahmati, Xiao, Prakash, Kohno, & Song (2018) Robust Physical-World Attacks on Deep Learning Visual Classification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ”. Thus, deep neural networks have difficulty adapting to new tasks, for which they have not been specifically trained, and are therefore generally not very versatile. In addition, this supervised training is also found to limit the performance of deep networks even on the target task of training, as explained in the document "Arandjelovic & Zisserman (2018) Objects that sound. European Conference on Computer Vision (ECCV) ”. A deep network previously trained in a self-supervised manner, before being trained on a visual classification task by conventional supervision, exceeds the performance of a network of the same size trained only by supervision. More precisely, the model described by Arandjelovic & Zisserman is composed of two networks (visual and auditory), and trained to determine whether audio and visual input data comes from the same video sequence or not. The labels (or tags) for this task are binary, so as to indicate a match or a non-match, and are constructed automatically from videos obtained from the "YouTube" site, without requiring human decision. This type of learning is said to be "self-supervised".

Un but de l’invention est de configurer un dispositif d’imagerie d’un véhicule automobile comportant un capteur et au moins un réseau de neurones encodeur associé, par entraînement du réseau de neurones encodeur, de manière à developper des codes, représentatifs de scènes détectées, qui soient plus robustes.An aim of the invention is to configure an imaging device of a motor vehicle comprising a sensor and at least one associated encoder neural network, by training the encoder neural network, so as to develop codes, representative of scenes. detected, which are more robust.

Dans ce but et dans un premier aspect, la présente invention concerne un procédé de configuration d’un dispositif d’imagerie d’un véhicule automobile comportant un dispositif optique C1 de capture d’images et un réseau de neurones encodeur RNe₁associé, comprenant une phase d’entraînement du réseau de neurones encodeur RNe₁comportant un processus itératif qui comprend, pour chaque itération d’indice i, les étapes de :
A) fourniture de N images IM_n,i, n variant de 1 à N, respectivement à N réseaux de neurones encodeurs RNe_n, incluant le réseau de neurones RNe₁associé au dispositif optique C1, en tant que réseau cible, et N-1 réseaux de neurones annexes RNe₂,…, RNe_N, les N images ayant été préalablement capturées par N dispositifs optiques Cn de capture d’images montés sur un véhicule d’acquisition d’images, incluant un dispositif optique de référence, correspondant au dispositif optique C1, ayant un champ de vision de référence, et N-1 dispositifs optiques annexes Cn, avec n allant de 2 à N, montés sur le véhicule d’acquisition de façon décalée par rapport au dispositif optique de référence de manière à avoir des champs de vision respectifs présentant un taux de recouvrement avec le champ de vision de référence supérieur à 70%;
B) codage de l’image IM_n,ifournie en un code descripteur CD_n,i, par chaque réseau de neurones encodeur RNe_n,
C) pour chacune des N-1 paires d’images incluant l’image IM_1,iet l’image IM_n,iavec n allant de 2 à N,
- calcul de la distance entre les deux codes descripteurs correspondants CD_1,iet CD_n,i,
- prédiction d’une classe, représentative d’une information de correspondance ou de non-correspondance des images IM_1,iet IM_n,ide la paire, par classification binaire de ladite distance, et
- calcul d’une erreur entre la classe prédite et une classe réelle donnée par une étiquette cible associée à ladite paire d’images IM_1,iet IM_n,iet préalablement connue,
D) une étape d’ajustement des poids de connexion entre neurones des réseaux de neurones encodeurs RNe₁et RNe_n, en fonction de ladite erreur.With this aim and in a first aspect, the present invention relates to a method of configuring an imaging device of a motor vehicle comprising an optical device C1 for capturing images and an _{associated encoder neural network RNe 1} , comprising a training phase of the encoder neural network RNe ₁ comprising an iterative process which comprises, for each iteration of index i, the steps of:
A) supply of N images IM _{n, i} , n varying from 1 to N, respectively to N encoder neural networks RNe _n , including the neural network RNe ₁ associated with the optical device C1, as target network, and N- 1 neural networks RNe ₂ ,…, RNe _N , the N images having been captured beforehand by N optical image capture devices Cn mounted on an image acquisition vehicle, including a reference optical device, corresponding to the optical device C1, having a reference field of vision, and N-1 ancillary optical devices Cn, with n ranging from 2 to N, mounted on the acquisition vehicle offset from the reference optical device so as to have respective fields of vision exhibiting an overlap rate with the reference field of vision greater than 70%;
B) encoding of the image IM _{n, i} supplied in a descriptor code CD _{n, i} , by each encoder neural network RNe _n ,
C) for each of the N-1 pairs of images including image IM _{1, i} and image IM _{n, i} with n ranging from 2 to N,
- calculation of the distance between the two corresponding descriptor codes CD _{1, i} and CD _{n, i} ,
- prediction of a class, representative of correspondence or non-correspondence information of the images IM _{1, i} and IM _{n, i} of the pair, by binary classification of said distance, and
- calculation of an error between the predicted class and a real class given by a target label associated with said pair of images IM _{1, i} and IM _{n, i} and previously known,
D) a step of adjusting the connection weights between neurons of the encoder neural networks RNe ₁ and RNe _n , as a function of said error.

Selon l’invention, un réseau de neurones cible, correspondant au réseau de neurones encodeur dédié au dispositif optique du véhicule (embarqué dans le véhicule), et un ou plusieurs réseaux de neurones annexes, dédiés à des vues différentes de celle du dispositif optique du véhicule, sont entraînés à détecter la correspondance entre vues. Par les termes «vues différentes», on entend désigner des vues prises par des dispositifs optiques annexes ayant chacun un champ de vision qui est presque le même que celui du dispositif optique véhicule, le taux de recouvrement des champs vision étant supérieur à 70%, mais ayant un angle de capture d’image différent de celui du dispositif optique du véhicule. Les dispositifs optiques annexes ont des positions décalées (spatialement et avantageusement angulairement) par rapport au dispositif optique C1. Le réseau de neurones cible et les réseaux de neurones annexes sont ensuite entraînés à détecter la correspondance ou non-correspondance entre vues, c’est-à-dire si les vues prises correspondent à une même scène ou à des scènes différentes. Lors de l’étape A, les images peuvent être fournies N par N aux réseaux de neurones RNe₁à RNe_nou deux par deux (une image au réseau RNe₁et une image à l’un des réseau RNe₂à RNe_N, les autres réseaux étant désactivés).According to the invention, a target neural network, corresponding to the encoder neural network dedicated to the optical device of the vehicle (on board the vehicle), and one or more annex neural networks, dedicated to views different from that of the optical device of the vehicle. vehicle, are trained to detect the correspondence between views. The terms “different views” are understood to denote views taken by ancillary optical devices each having a field of view which is almost the same as that of the vehicle optical device, the rate of coverage of the fields of vision being greater than 70%, but having an image capture angle different from that of the optical device of the vehicle. The ancillary optical devices have offset positions (spatially and advantageously angularly) with respect to the optical device C1. The target neural network and the annexed neural networks are then trained to detect the correspondence or non-correspondence between views, that is to say whether the views taken correspond to the same scene or to different scenes. During step A, the images can be supplied N by N to the neural networks RNe ₁ to RNe _n or two by two (one image to the RNe ₁ network and one image to one of the RNe ₂ to RNe _N networks, the other networks being deactivated).

Avantageusement, lors de l’étape C) de calcul de distance, on calcule une distance euclidienne entre les deux codes descripteurs CD_1,iet CD_n,i Advantageously, during the distance calculation step C), a Euclidean distance is calculated between the two descriptor codes CD _{1, i} and CD _{n, i}

La distance entre les deux codes CD_1,iet CD_n,ipeut être calculée par un neurone de distance, lequel est connecté à deux neurones de sortie binaires, correspondant respectivement aux deux alternatives de même scène (correspondance entre vues) et de scènes différentes (non-correspondance entre vues).The distance between the two codes CD _{1, i} and CD _{n, i} can be calculated by a distance neuron, which is connected to two binary output neurons, corresponding respectively to the two alternatives of the same scene (correspondence between views) and of scenes different (mismatch between views).

Le procédé de configuration comprend avantageusement une opération de génération d’une base de données d’apprentissage (ou d’entraînement) comportant une étape de capture d’images par lesdits N dispositifs optiques comportant le dispositif optique de référence, correspondant au dispositif optique C1, et les N-1 dispositifs optiques annexes Cn, avec n allant de 2 à N, montés sur le véhicule d’acquisition d’images.The configuration method advantageously comprises an operation of generating a training (or training) database comprising a step of capturing images by said N optical devices comprising the reference optical device, corresponding to the optical device C1 , and the N-1 ancillary optical devices Cn, with n ranging from 2 to N, mounted on the image acquisition vehicle.

La génération de la base de données d’apprentissage comprend avantageusement les étapes de:
- génération d’échantillons d’apprentissage positifs, chaque échantillon d’apprentissage positif comportant une image capturée par le dispositif optique de référence et au moins une image capturée par l’un des dispositifs optiques annexes C2 à CN, lesdites images correspondant à une même scène;
- création, pour chaque échantillon d’apprentissage positif, d’une étiquette associée indiquant que les images dudit échantillon d’apprentissage correspondent à une même scène,
lesdites étapes de génération d’échantillons d’apprentissage positifs et de création d’étiquettes associées étant mises en œuvre en utilisant des données d’horodatage des images.The generation of the training database advantageously comprises the steps of:
- generation of positive training samples, each positive training sample comprising an image captured by the reference optical device and at least one image captured by one of the auxiliary optical devices C2 to CN, said images corresponding to the same scene;
- creation, for each positive training sample, of an associated label indicating that the images of said training sample correspond to the same scene,
said steps of generating positive training samples and creating associated labels being implemented using time stamp data of the images.

Dans un mode de réalisation particulier, la génération de la base de données d’apprentissage comprend également les étapes de
- génération d’échantillons d’apprentissage négatifs, chaque échantillon d’apprentissage négatif comportant au moins une image capturée par le dispositif optique de référence et une image capturée par l’un des dispositifs optiques annexes C2 à CN, lesdites images correspondant à des scènes différentes;
- création, pour chaque échantillon d’apprentissage négatif, d’une étiquette associée indiquant que les images dudit échantillon d’apprentissage correspondent à des scènes différentes,
lesdites étapes de génération d’échantillons d’apprentissage négatifs et de création d’étiquettes associées étant mises en œuvre en utilisant des données d’horodatage des images.In a particular embodiment, the generation of the training database also comprises the steps of
- generation of negative training samples, each negative training sample comprising at least one image captured by the reference optical device and one image captured by one of the auxiliary optical devices C2 to CN, said images corresponding to scenes different;
- creation, for each negative training sample, of an associated label indicating that the images of said training sample correspond to different scenes,
said steps of generating negative training samples and creating associated labels being implemented using time stamp data of the images.

Avantageusement, le dispositif d’imagerie du véhicule automobile comporte un décodeur pour réaliser une tâche de décodage spécifique, et il est prévu une autre phase d’entraînement lors de laquelle ledit décodeur est entraîné, de manière supervisée, sur des codes descripteurs fournis par le réseau de neurones encodeurs RNe1 cible préalablement entraîné.Advantageously, the imaging device of the motor vehicle comprises a decoder for performing a specific decoding task, and another training phase is provided during which said decoder is trained, in a supervised manner, on descriptor codes supplied by the device. previously trained target RNe1 encoder neural network.

Le décodeur peut comporter un réseau de neurones pour réaliser ladite tâche de décodage spécifique.The decoder may include a neural network to perform said specific decoding task.

Le dispositif optique C1 peut être de l’un des types comportant une caméra, un lidar et un radar, et les dispositifs optiques annexes C2 à CN peuvent être de même type que le dispositif optique C1.The optical device C1 can be one of the types comprising a camera, a lidar and a radar, and the auxiliary optical devices C2 to CN can be of the same type as the optical device C1.

Un autre aspect de l’invention concerne un système de configuration d’un dispositif d’imagerie d’un véhicule automobile comportant un dispositif optique C1 de capture d’images et un réseau de neurones encodeur RNe₁associé, comprenant un système d’entraînement du réseau de neurones encodeur RNe₁par la mise œuvre d’un processus itératif qui comprend, pour chaque itération d’indice i :
A) la fourniture de N images IM_n,i, n variant de 1 à N, respectivement à N réseaux de neurones encodeurs RNe_n, incluant le réseau de neurones RNe₁associé au dispositif optique C1, en tant que réseau cible, et N-1 réseaux de neurones annexes RNe₂,…, RNe_N, les N images ayant été préalablement capturées par N dispositifs optiques Cn de capture d’images montés sur un véhicule d’acquisition d’images, incluant un dispositif optique de référence, correspondant au dispositif optique C1, ayant un champ de vision de référence, et N-1 dispositifs optiques annexes Cn, avec n allant de 2 à N, montés sur le véhicule d’acquisition de façon décalée par rapport au dispositif optique de référence de manière à avoir des champs de vision respectifs présentant un taux de recouvrement avec le champ de vision de référence supérieur à 70%;
B) le codage de l’image IM_n,ifournie en un code descripteur CD_n,i, par chaque réseau de neurones encodeur RNe_n,
C) pour chacune des N-1 paires d’images incluant l’image IM_1,iet l’image IM_n,iavec n allant de 2 à N,
- le calcul de la distance entre les deux codes descripteurs correspondants CD_1,iet CD_n,i,
- la prédiction d’une classe, représentative d’une information de correspondance ou de non-correspondance des images IM_1,iet IM_n,ide la paire, par classification binaire de ladite distance, et
- le calcul d’une erreur entre la classe prédite et une classe réelle donnée par une étiquette cible associée à ladite paire d’images IM_1,iet IM_n,iet préalablement connue,
D) l’ajustement des poids de connexion entre neurones des réseaux de neurones encodeurs RNe₁et RNe_n, en fonction de ladite erreur.Another aspect of the invention relates to a system for configuring an imaging device of a motor vehicle comprising an optical device C1 for capturing images and an _{associated encoder neural network RNe 1} , comprising a drive system of the encoder neural network RNe ₁ by the implementation of an iterative process which comprises, for each iteration of index i:
A) the supply of N images IM _{n, i} , n varying from 1 to N, respectively to N encoder neural networks RNe _n , including the neural network RNe ₁ associated with the optical device C1, as target network, and N -1 networks of annexed neurons RNe ₂ ,…, RNe _N , the N images having been captured beforehand by N optical devices Cn for capturing images mounted on an image acquisition vehicle, including a reference optical device, corresponding to the optical device C1, having a reference field of vision, and N-1 ancillary optical devices Cn, with n ranging from 2 to N, mounted on the acquisition vehicle offset from the reference optical device so as to have respective fields of vision exhibiting an overlap rate with the reference field of vision greater than 70%;
B) the encoding of the image IM _{n, i} supplied in a descriptor code CD _{n, i} , by each encoder neural network RNe _n ,
C) for each of the N-1 pairs of images including image IM _{1, i} and image IM _{n, i} with n ranging from 2 to N,
- the calculation of the distance between the two corresponding descriptor codes CD _{1, i} and CD _{n, i} ,
- the prediction of a class, representative of correspondence or non-correspondence information of the images IM _{1, i} and IM _{n, i} of the pair, by binary classification of said distance, and
the calculation of an error between the predicted class and a real class given by a target label associated with said pair of images IM _{1, i} and IM _{n, i} and previously known,
D) the adjustment of the connection weights between neurons of the encoder neural networks RNe ₁ and RNe _n , as a function of said error.

Le dispositif d’imagerie du véhicule automobile comportant un décodeur pour réaliser une tâche de décodage spécifique, ledit système comporte avantageusement un autre système d’entraînement du décodeur configuré pour entraîner ledit décodeur, de manière supervisée, sur des codes descripteurs fournis par le réseau de neurones encodeur RNe₁préalablement entraîné.The imaging device of the motor vehicle comprising a decoder for performing a specific decoding task, said system advantageously comprises another decoder drive system configured to drive said decoder, in a supervised manner, on descriptor codes supplied by the network of previously trained _{RNe 1} encoder neurons.

BRÈVE DESCRIPTION DES FIGURESBRIEF DESCRIPTION OF THE FIGURES

D'autres caractéristiques et avantages de la présente invention apparaîtront plus clairement à la lecture de la description détaillée qui va suivre et qui présente différents modes de réalisation de l’invention donnés à titre d’exemples nullement limitatifs et illustrés par les figures annexées dans lesquelles :Other characteristics and advantages of the present invention will emerge more clearly on reading the detailed description which will follow and which presents various embodiments of the invention given by way of non-limiting examples and illustrated by the appended figures in which :

représente un dispositif d’imagerie d’un véhicule automobile ; represents an imaging device of a motor vehicle;

représente de façon schématique une première phase d’entraînement d’un réseau de neurones encodeur cible du dispositif d’imagerie de la figure 1, utilisant des réseaux de neurones encodeurs annexes; schematically represents a first phase of training a target encoder neural network of the imaging device of FIG. 1, using annex encoder neural networks;

représente de façon schématique une deuxième phase d’entraînement d’un réseau de neurones dit de «décodage» destiné à la réalisation d’une tâche de décodage spécifique; schematically represents a second phase of training of a so-called “decoding” neural network intended for carrying out a specific decoding task;

représente un organigramme de la première phase d’entraînement de la figure 2; et represents a flowchart of the first training phase of FIG. 2; and

représente un organigramme d’une étape initiale de génération d’une base de données d’apprentissage. represents a flowchart of an initial step in generating a training database.

DESCRIPTION DÉTAILLÉEDETAILED DESCRIPTION

La présente invention concerne un procédé et un système de configuration d’un dispositif d’imagerie 1 d’un véhicule automobile V.The present invention relates to a method and a system for configuring an imaging device 1 of a motor vehicle V.

Le dispositif d’imagerie 1, représenté sur la figure 1, est embarqué sur le véhicule automobile V après configuration. Il comporte un dispositif optique de capture d’images C1 et un module 20 de traitement des images.The imaging device 1, shown in Figure 1, is on board the motor vehicle V after configuration. It comprises an optical image capture device C1 and an image processing module 20.

Le dispositif optique C1 peut être une caméra optique, par exemple une caméra optique monoculaire. Il comporte un capteur d’images 11 et un système optique 12. Le capteur d’images 11 est par exemple un capteur d’images bidimensionnelles de type CCD (de l’anglais «Charge-Coupled Device» ou «dispositif à transfert de charge») ou CMOS (de l’anglais «Complementary Metal-Oxide Semiconductor» ou «semi-conducteur à oxyde de métal complémentaire»). Le système optique 12 est adapté pour former une image d’au moins une partie de l’environnement extérieur du véhicule sur le capteur d’images 11.The optical device C1 can be an optical camera, for example a monocular optical camera. It comprises an image sensor 11 and an optical system 12. The image sensor 11 is for example a two-dimensional image sensor of the CCD type (standing for “Charge-Coupled Device” or “charge transfer device”. ”) Or CMOS (standing for“ Complementary Metal-Oxide Semiconductor ”). The optical system 12 is adapted to form an image of at least a portion of the external environment of the vehicle on the image sensor 11.

Le module de traitement 20 comporte un calculateur 21, ou un microprocesseur, et un module de stockage 22 dans lequel est implémenté un module logiciel de traitement en temps réel des images capturées par le dispositif optique de capture d’images C1. Le module de traitement 20 fait de la reconnaissance visuelle et réalise au moins une tâche cible, par exemple d’identification de piétons. Il pourrait effectuer d’autres tâche cibles liées à la reconnaissance visuelle, par exemple identification de véhicules, identification de vélos, identification de panneaux de signalisation, etc...The processing module 20 comprises a computer 21, or a microprocessor, and a storage module 22 in which is implemented a software module for processing in real time the images captured by the optical image capture device C1. The processing module 20 performs visual recognition and performs at least one target task, for example identifying pedestrians. It could perform other target tasks related to visual recognition, for example vehicle identification, bicycle identification, identification of traffic signs, etc.

Fonctionnellement, le module de traitement 20 comporte un réseau de neurones, dit «encodeur» et noté «RNe₁ », associé au dispositif optique C1. En fonctionnement, le réseau de neurones RNe₁reçoit en entrée des images capturées par le dispositif optique C1 et fournit en sortie des codes correspondants, également appelés codes descripteurs, représentatifs de l’environnement du véhicule V ou d’au moins une partie de cet environnement telle que vue par le dispositif optique C1. Le réseau de neurones RNe₁est ici un réseau de neurones à convolutions.Functionally, the processing module 20 comprises a neural network, called an “encoder” and denoted “RNe₁ », Associated with the optical device C1. In operation, the RNe neural network₁receives as input images captured by optical device C1 and outputs corresponding codes, also called descriptor codes, representative of the environment of vehicle V or of at least part of this environment as seen by optical device C1 . The RNe neural network₁is here a convolutional neural network.

Le module de traitement 20 comprend également au moins un décodeur DCD spécifique à la tâche cible (ici une tâche d’identification de piétons). Dans l’exemple de réalisation décrit ici, le décodeur DCD comprend un réseau de neurones de décodage, noté «RNd».The processing module 20 also includes at least one DCD decoder specific to the target task (here a pedestrian identification task). In the exemplary embodiment described here, the DCD decoder comprises a decoding neural network, denoted "RNd".

Le module de traitement 20 peut comprendre plusieurs décodeurs spécifiques à plusieurs tâches cibles, par exemple identifier des piétons, identifier des véhicules, identifier des vélos, identifier des panneaux de signalisation, etc... Chacun de ces décodeurs spécifiques est basé sur un réseau de neurones de décodage et utilise les codes fournis par le réseau de neurones encodeur RNe₁associé au dispositif optique C1.The processing module 20 can comprise several decoders specific to several target tasks, for example identifying pedestrians, identifying vehicles, identifying bicycles, identifying traffic signs, etc. Each of these specific decoders is based on a network of decoding neurons and uses the codes supplied by the encoder neural network RNe ₁ associated with the optical device C1.

Le procédé de configuration du dispositif d’imagerie 1, selon un mode de réalisation particulier, va maintenant être décrit en référence aux figures 2 à 4.The method of configuring the imaging device 1, according to a particular embodiment, will now be described with reference to Figures 2 to 4.

En référence à la figure 5, le procédé de configuration du dispositif d’imagerie 1 comprend une opération préalable E0 de génération d’une base de données d’apprentissage, également appelée base de données d’entraînement.Referring to Figure 5, the method of configuring the imaging device 1 comprises a prior operation E0 of generating a training database, also called a training database.

La génération de la base de données d’apprentissage ou d’entraînement E0 comporte une étape initiale E01 de capture d’images, suivie d’une étape E02 de création d’échantillons d’apprentissage.The generation of the learning or training database E0 comprises an initial step E01 of capturing images, followed by a step E02 of creating learning samples.

La capture d’images est réalisée par N dispositifs optiques de capture d’images (N étant supérieur ou égal à 2), incluant un dispositif optique de référence, noté C1’, correspondant au dispositif optique C1 (c’est-à-dire soit le dispositif optique C1, soit un dispositif optique similaire) et N-1 dispositifs optiques annexes de capture d’images C2, …, CN, tous montés sur un même véhicule automobile, lors d’une ou plusieurs sessions de roulage. Le véhicule utilisé pour faire l'acquisition d'images afin de constituter la base de données d'entrainement n’est pas nécessairement le véhicule final V. Il peut s’agir d’un véhicule spécifique pour l’acquisition d’images. Lors de chaque session de roulage, un grand nombre d’images sont capturées par le dispositif optique de référence et les dispositifs optiques annexes C2 à CN et stockées avec des données d’horodatage indiquant les instants auxquels les images ont été capturées. Typiquement, plusieurs dizaines de milliers d’images (ou vues), provenant avantageusement de plusieurs sessions de roulage, sont capturées et stockées.The image capture is carried out by N optical image capture devices (N being greater than or equal to 2), including a reference optical device, denoted C1 ′, corresponding to the optical device C1 (that is to say either the optical device C1, or a similar optical device) and N-1 auxiliary optical devices for capturing images C2,…, CN, all mounted on the same motor vehicle, during one or more driving sessions. The vehicle used to acquire images in order to constitute the training database is not necessarily the final vehicle V. It may be a specific vehicle for the acquisition of images. During each driving session, a large number of images are captured by the reference optical device and ancillary optical devices C2 to CN and stored with time stamp data indicating the times at which the images were captured. Typically, several tens of thousands of images (or views), advantageously coming from several driving sessions, are captured and stored.

Lors de cette étape initiale de capture d’images E01, le dispositif optique de référence C1’ est monté sur le véhicule en une position de référence. Cette position de référence correspond avantageusement à la position du dispositif optique C1 sur le véhicule V . Les N-1 dispositifs optiques annexes C2, …, CN sont montés de façon décalée par rapport au dispositif optique de référence. Le décalage de chacun des dispositifs optiques annexes par rapport au dispositif optique de référence est tel que chacun des N-1 champs de vision respectifs des N-1 dispositifs optiques annexes C2 à CN présente un taux de recouvrement supérieur à 70% avec le champ de vision de référence du dispositif optique de référence. Les N-1 dispositifs optiques annexes C2 à CN peuvent être décalés spatialement par rapport à la position de référence du dispositif optique de référence et avoir des angles de capture d’images différents de celui du dispositif optique de référence, de manière à ce que les champs de vision se recouvrent largement (avec un taux de recouvrement supérieur à 70% entre le champ de vision de référence et celui de chacun des dispositifs annexes C2 et CN). Les décalages respectifs des N-1 dispositifs optiques annexes par rapport au dispositif optique de référence sont avantageusement différents les uns des autres. Les N dispositifs optiques sont avantageusement de même type, ici de type caméra. En variante, les N dispositifs optiques peuvent être de type lidar ou de type radar. Les images prises par les N dispositifs optiques n’ont pas nécessairement le même format et/ou la même résolution. D’une manière plus générale, les images capturées par les N dispositifs optiques peuvent être décorrélées par variation d’une ou plusieurs caractéristiques d’image et/ou de capture d’image (par exemple la luminosité et/ou le format), afin de favoriser la production de codes de haut niveau d’abstraction.During this initial image capture step E01, the optical reference device C1 ’is mounted on the vehicle in a reference position. This reference position advantageously corresponds to the position of the optical device C1 on the vehicle V. The N-1 ancillary optical devices C2,…, CN are mounted offset from the reference optical device. The offset of each of the ancillary optical devices with respect to the reference optical device is such that each of the respective N-1 fields of view of the N-1 ancillary optical devices C2 to CN has an overlap rate greater than 70% with the field of view. reference vision of the reference optical device. The N-1 ancillary optical devices C2 to CN can be spatially offset from the reference position of the reference optical device and have image capture angles different from that of the reference optical device, so that the fields of vision overlap widely (with a recovery rate greater than 70% between the reference field of vision and that of each of the ancillary devices C2 and CN). The respective offsets of the N-1 auxiliary optical devices with respect to the reference optical device are advantageously different from each other. The N optical devices are advantageously of the same type, here of the camera type. As a variant, the N optical devices can be of the lidar type or of the radar type. The images taken by the N optical devices do not necessarily have the same format and / or the same resolution. More generally, the images captured by the N optical devices can be decorrelated by varying one or more image and / or image capture characteristics (for example the brightness and / or the format), in order to to promote the production of high-level abstraction codes.

L’étape E01 de capture et de stockage d’images est suivie de l’étape E02 de génération d’échantillons d’apprentissage, comportant des échantillons positifs et des échantillons négatifs.Step E01 of capturing and storing images is followed by step E02 of generating training samples, comprising positive samples and negative samples.

Chaque échantillon d’apprentissage comporte au moins deux images, dont une image capturée par le dispositif optique de référence C1’ et au moins une autre image capturée par l’un parmi les N-1 dispositifs optiques annexes Cn avec n allant de 2 à N.Each training sample comprises at least two images, including one image captured by the reference optical device C1 ′ and at least one other image captured by one of the N-1 ancillary optical devices Cn with n ranging from 2 to N .

En variante, chaque échantillon d’apprentissage pourrait comprendre N images, respectivement capturées par les N dispositifs optiques C1’ et C2 à CN.Alternatively, each training sample could comprise N images, respectively captured by the N optical devices C1 ’and C2 to CN.

Les échantillons d’apprentissage générés comprennent des échantillons positifs, obtenus en sélectionnant des images ou vues provenant d’une même scène. Ces images sont sélectionnées sur la base des données d’horodatage associées aux images, deux images différentes étant considérées correspondre à une même scène si elles ont été capturées au même instant ou dans un intervalle de temps réduit, de préférence inférieur à 50 millisecondes.The training samples generated include positive samples, obtained by selecting images or views from the same scene. These images are selected on the basis of the time stamp data associated with the images, two different images being considered to correspond to the same scene if they were captured at the same instant or in a reduced time interval, preferably less than 50 milliseconds.

Les échantillons générés peuvent également comprendre des échantillons négatifs, obtenus en sélectionnant des images ou vues provenant de scènes différentes, c’est-à-dire capturées à des instants différents ou n’appartenant pas à l’intervalle de temps réduit prédéfini, sur la base des données d’horodatage associées aux images.The samples generated may also include negative samples, obtained by selecting images or views from different scenes, i.e. captured at different times or not belonging to the predefined reduced time interval, on the base of the timestamp data associated with the images.

Pour chaque échantillon d’apprentissage, une étiquette, également appelée «label», classifiant cet échantillon est créée et ajoutée ou associée à l’échantillon, lors d’une étape E03. Cette étiquette classifie l’échantillon (positif ou négatif). Elle correspond à une information de correspondance ou de non-correspondance des images de l’échantillon. Cette information de correspondance ou de non-correspondance est également appelée «classe de l’échantillon». Il s’agit par exemple d’une information binaire ayant une valeur égale 1 ou 0. Par convention, dans l’exemple de réalisation décrit ici, la classe «1» d’un échantillon indique que les images de l’échantillon sont des vues d’une même scène, et la classe «0» d’un échantillon indique les images de l’échantillon sont des vues de scènes différentes. La classe 1 est donc attribuées aux échantillons positifs, et la classe 0 est attribuée aux échantillons négatifs.For each training sample, a label, also called a "label", classifying this sample is created and added or associated with the sample, during a step E03. This label classifies the sample (positive or negative). It corresponds to correspondence or non-correspondence information of the images of the sample. This match or non-match information is also called the "sample class". This is for example binary information having a value equal to 1 or 0. By convention, in the exemplary embodiment described here, the class “1” of a sample indicates that the images of the sample are views of the same scene, and the class "0" of a sample indicates the sample images are views of different scenes. Class 1 is therefore assigned to positive samples, and class 0 is assigned to negative samples.

Les échantillons d’apprentissage et leurs étiquettes associées sont stockées dans la base de données d’apprentissage, lors d’une étape E04.The training samples and their associated labels are stored in the training database, during a step E04.

La base données d’apprentissage générée comprend donc des échantillons d’apprentissage positifs, associés chacun à une étiquette indiquant la classe réelle de l’échantillon obtenue sur la base des données d’horodatage des images (ici la classe 1), et des échantillons d’apprentissage négatifs, associés chacun à une étiquette indiquant la classe réelle de l’échantillon obtenue sur la base des données d’horodatage des images (ici la classe 0),The training database generated therefore comprises positive training samples, each associated with a label indicating the real class of the sample obtained on the basis of the time-stamping data of the images (here class 1), and the samples negative learning, each associated with a label indicating the real class of the sample obtained on the basis of the timestamp data of the images (here class 0),

En référence à la figure 4, le procédé de configuration comprend une première phase d’entraînement Ph1 du réseau de neurones encodeur RNe₁associé au dispositif optique C1. La phase d’entraînement Ph1 est un processus d’apprentissage itératif. On note «i» un indice d’itération, initialement égal à 1. A chaque itération d’indice i, des étapes A, B, C et D, décrites ci-après, sont exécutées par un premier système d’entraînement ayant accès à la base de données d’apprentissage.With reference to FIG. 4, the configuration method comprises a first phase of training Ph1 of the encoder neural network RNe ₁ associated with the optical device C1. The Ph1 training phase is an iterative learning process. We denote "i" an iteration index, initially equal to 1. At each iteration of index i, steps A, B, C and D, described below, are executed by a first drive system having access to the training database.

En référence à la figure 2, la première phase d’entraînement Ph1 utilise N réseaux de neurones encodeurs RNe_n, incluant le réseau de neurones RNe₁associé au dispositif optique C1, en tant que réseau cible, et N-1 réseaux de neurones annexes RNe₂,…, RNe_Nrespectivement associés aux N-2 dispositifs optiques annexes C2 à CN, de positions respectives décalées par rapport au dispositif optique de référence C1’. Les N réseaux de neurones RNe_navec n allant de 1 à N, utilisés lors de la phase d’entraînement Ph1 pour entraîner le réseau de neurones encodeur RNe₁cible, sont dédiés au traitement des images provenant des dispositifs optiques C1 à CN respectivement.With reference to FIG. 2, the first training phase Ph1 uses N encoder neural networks RNe _n , including the neural network RNe ₁ associated with the optical device C1, as the target network, and N-1 networks of annexed neurons RNe ₂ ,…, RNe _N respectively associated with N-2 auxiliary optical devices C2 to CN, with respective positions offset with respect to the reference optical device C1 ′. The N neural networks RNe _n with n ranging from 1 to N, used during the training phase Ph1 to train the _{target encoder neural network RNe 1} , are dedicated to the processing of the images coming from the optical devices C1 to CN respectively.

Dans un exemple de réalisation particulier, les réseaux RNe₁à RNe_Npossèdent chacun entre cinq et dix niveaux de traitement. Par exemple, l’architecture neuronale de chacun des réseaux RNe₁à RNe_Nest basée sur le réseau «AVEnet» décrit dans le document «Arandjelovic & Zisserman (2018) Objects that sound. European Conference on Computer Vision (ECCV)». Suivant l’architecture présentée dans ce document, chaque niveau de traitement est lui-même composé de sept opérations : convolution, normalisation par lot ou «batch», rectification linéaire, convolution, normalisation par lot ou «batch», rectification linéaire, et «pooling» ou «mise en commun» (sous-échantillonnage par opération "Max"). Dans chaque réseau, les niveaux successifs comprennent un nombre égal ou croissant de canaux, partant de 64 canaux pour le premier niveau et arrivant à 512 canaux pour le dernier niveau. Dans ces niveaux successifs, chaque canal implémente un noyau de convolution de champ récepteur constant 3x3, avec un empiétement inférieur à 2 unités à travers les niveaux. L’opérateur de «pooling» (mise en commun) utilise un rayon constant de 2x2 unités pour tous les niveaux sauf le dernier, dont le rayon doit-être adapté au nombre de niveaux de traitement pour obtenir des canaux de dimension 1x1. La partie supérieure de chaque réseau à convolution consiste en deux couches pleinement connectées avec des fonctions d’activation ReLu.In a particular exemplary embodiment, the networks RNe ₁ to RNe _N each have between five and ten processing levels. For example, the neural architecture of each of the networks RNe ₁ to RNe _N is based on the “AVEnet” network described in the document “Arandjelovic & Zisserman (2018) Objects that sound. European Conference on Computer Vision (ECCV) ”. According to the architecture presented in this document, each level of processing is itself composed of seven operations: convolution, normalization by batch or "batch", linear rectification, convolution, normalization by batch or "batch", linear rectification, and " pooling ”or“ pooling ”(downsampling by“ Max ”operation). In each network, the successive levels include an equal or increasing number of channels, starting from 64 channels for the first level and arriving at 512 channels for the last level. In these successive levels, each channel implements a 3x3 constant receptor field convolution kernel, with less than 2 units encroachment across the levels. The “pooling” operator uses a constant radius of 2x2 units for all levels except the last, whose radius must be adapted to the number of processing levels to obtain channels of dimension 1x1. The top of each convolutional network consists of two fully connected layers with ReLu activation functions.

Lors de l’étape A, la base de données d’apprentissage fournit au moins N images IM_n,i, n variant de 1 à N, respectivement aux N réseaux de neurones RNe₁à RNe_N. Ces images IM_1,ià IM_N,iproviennent des échantillons d’apprentissage préalablement capturées par les N dispositifs optiques C1’, C2 à CN puis stockés. Selon que les images proviennent d’échantillons d’apprentissage positifs ou négatifs, chacune des images IM₂ _,ià IM_N,iprovenant des dispositifs optiques annexes C2 à CN est soit en correspondance avec l’image IM_1,iprovenant du dispositif optique de référence C1’ (i.e. chacune des images images IM_2,ià IM_N,iet l’image IM_1,icorrespondent à deux vues d’une même scène), soit en non-correspondance avec l’image IM_1,i(i.e. chacune des images images IM_2,ià IM_N,iet l’image IM_1,icorrespondent à deux scènes différentes).During step A, the training database supplies at least N images IM _{n, i} , n varying from 1 to N, respectively to the N neural networks RNe ₁ to RNe _N. These images IM _{1, i} to IM _{N, i} come from the training samples previously captured by the N optical devices C1 ′, C2 to CN and then stored. Depending on whether the images come from positive or negative training samples, each of the images IM ₂ _{, i} to IM _{N, i} coming from the auxiliary optical devices C2 to CN is either in correspondence with the image IM _{1, i} coming from the device reference optics C1 '(ie each of the images IM _{2 images, i} to IM _{N, i} and the image IM _{1, i} correspond to two views of the same scene), or in non-correspondence with the image IM _{1 , i} (ie each of the image images IM _{2, i} to IM _{N, i} and the image IM _{1, i} correspond to two different scenes).

Les N images IM_1,ià IM_N,ipeuvent être fournies N par N aux réseaux de neurones RNe₁à RNe_N. Autrement dit, à chaque itération i, les N images sont fournies parallèlement aux N réseaux de neurones RNe₁à RNe_N. En variante, les N images peuvent être fournies deux par deux, c’est-à-dire au réseau RNe₁et à l’un des réseaux RNe₂à RNe_N, en désactivant les réseaux non-utilisés, de manière à limiter l’interférence durant l’apprentissage.The N images IM _{1, i} to IM _{N, i} can be supplied N by N to the neural networks RNe ₁ to RNe _N. In other words, at each iteration i, the N images are supplied in parallel with the N neural networks RNe ₁ to RNe _N. As a variant, the N images can be provided two by two, that is to say to the network RNe ₁ and to one of the networks RNe ₂ to RNe _N , by deactivating the unused networks, so as to limit the interference during learning.

L’étape suivante B est une étape de codage lors de laquelle chaque réseau de neurones RNe_n, avec n allant de 1 à N, calcule un code CD_n,ià partir de l’image IM_n,ireçue en entrée, et fournit ce code CD_n,ien sortie. Le code CD_n,iest un code descripteur qui correspond à une représentation de type vectoriel de la scène capturée par l’image IM_n _,i.The next step B is a coding step during which each neural network RNe _n , with n ranging from 1 to N, calculates a code CD _{n, i} from the image IM _{n, i} received as input, and provides this CD code _{n, i} as output. The code CD _{n, i} is a descriptor code which corresponds to a vector type representation of the scene captured by the image IM _n _{, i} .

Puis, lors de l’étape C, pour chacune des N-1 paires d’images incluant l’image IM_1,iet l’une des images IM₂ _,ià IM_N,i, le premier système d’entraînement
- calcule la distance entre les deux codes correspondants CD_1,iet CD_n,i(n allant de 2 à N), lors d’une sous-étape C-1,
- puis, par classification binaire de cette distance calculée, prédit la classe de cette paire d’images IM_1,iet IM_n,i(avec n allant de 2 à N) représentative d’une information de correspondance ou de non-correspondance des deux images de la paire d’images, lors d’une sous-étape C-2.Then, during step C, for each of the N-1 pairs of images including the image IM _{1, i} and one of the images IM ₂ _{, i} to IM _{N, i} , the first training system
- calculates the distance between the two corresponding codes CD _{1, i} and CD _{n, i} (n ranging from 2 to N), during a sub-step C-1,
- then, by binary classification of this calculated distance, predicts the class of this pair of images IM _{1, i} and IM _{n, i} (with n ranging from 2 to N) representative of a correspondence or non-correspondence information of the two images of the pair of images, during a sub-step C-2.

La distance entre les deux codes CD_1,iet CD_n,iest par exemple la distance euclidienne entre deux codes de type vectoriel. Elle est avantageusement calculée par un neurone de distance.The distance between the two codes CD _{1, i} and CD _{n, i} is for example the Euclidean distance between two codes of vector type. It is advantageously calculated by a distance neuron.

La classification binaire consiste à attribuer 0 à la paire d’images (IM_1,i;IM_n,i) soit la classe 1, soit la classe 0, en tant que classe prédite, selon la distance calculée. Par exemple, si la distance calculée est égale à zéro, la classe prédite attribuée est 1, et si la distance calculée est différente de zéro, la classe prédite attribuée est 0. Pour effectuer la classification, chaque neurone de distance est connecté à deux neurones de sortie, avec fonction d’activation «softmax», les deux neurones de sortie correspondant respectivement à la classe 1 et à la classe 0 (autrement dit aux deux alternatives d’une même scène et de deux scènes différentes). Le réseau apprend ainsi à catégoriser la paire d’images en identique/différente, sur la seule base de la distance calculée entre les deux codes. Ces connexions constituent un réseau de perceptron avec biais, dont les poids sont appris pendant l’entraînement.The binary classification consists in assigning 0 to the pair of images (IM _{1, i;} IM _{n, i} ) either class 1 or class 0, as a predicted class, according to the calculated distance. For example, if the calculated distance is zero, the assigned predicted class is 1, and if the calculated distance is non-zero, the assigned predicted class is 0. To perform the classification, each distance neuron is connected to two neurons output, with “softmax” activation function, the two output neurons corresponding respectively to class 1 and class 0 (in other words to the two alternatives of the same scene and of two different scenes). The network thus learns to categorize the pair of images as identical / different, on the sole basis of the distance calculated between the two codes. These connections form a biased perceptron network, the weights of which are learned during training.

Lors d’une sous-étape C-3, le premier système d’entraînement calcule une erreur de prédiction entre la classe prédite et la classe réelle donnée par l’étiquette associée à l’échantillon d’apprentissage dont proviennent les images IM_1,iet IM_n,i.During a sub-step C-3, the first training system calculates a prediction error between the predicted class and the real class given by the label associated with the training sample from which the images IM _{1 originate, i} and IM _{n, i} .

Le système d’entraînement vérifie l’erreur de prédiction lors d’une sous étape C-4. Si l’erreur de prédiction est significative, le procédé passe à l’étape D. Si l’erreur de prédiction est non significative, le procédé interrompt la boucle (pour le réseau de neurones concerné RNe_n) pour passer ensuite à la deuxième phase d’entraînement Ph2, décrite plus loin.The drive system checks the prediction error during a sub-step C-4. If the prediction error is significant, the method goes to step D. If the prediction error is insignificant, the method interrupts the loop (for the neural network concerned RNe _n ) to then go to the second phase training Ph2, described later.

L’étape suivante D est une étape d’ajustement ou de mise à jour des poids de connexion entre neurones des réseaux de neurones encodeurs RNe₁et RNe_n, en fonction de l’erreur de prédiction calculée lors de l’étape C. Une telle opération d’ajustement des poids de connexion des réseaux de neurones est bien connue de l’homme du métier. Les poids de connexion sont ajustés de sorte à réduire l’erreur de prédiction faite par le réseau de neurones dans son état actuel. Pour cela, un algorithme de descente de gradient peut être utilisé. Par exemple, on utilise l’algorithme supervisé ADAM de descente de gradient [Kingma & Ba (2015) Adam: A Method for Stochastic Optimization. Proceedings of the International Conference for Learning Representations, San Diego (ICLR)] avec un paramètre de décroissance des poids de 10^-5et un paramètre de taux d’apprentissage à déterminer par recherche par quadrillage. Une fonction de coût, ou fonction d’erreur, est utilisée. Par exemple, on utilise une fonction de type "binary cross-entropy", ou d’entropie croisée, avec logits, définie par l’équation suivante:

où
- y_ireprésente la classe prédite lors de l’étape C;
- t_ireprésente la classe réelle donnée par l’étiquette associée à l’échantillon d’apprentissage utilisé.The following step D is a step of adjusting or updating the connection weights between neurons of the encoder neural networks RNe ₁ and RNe _n , as a function of the prediction error calculated during step C. A such an operation of adjusting the connection weights of the neural networks is well known to those skilled in the art. The connection weights are adjusted so as to reduce the prediction error made by the neural network in its current state. For this, a gradient descent algorithm can be used. For example, we use the ADAM supervised gradient descent algorithm [ Kingma & Ba (2015) Adam: A Method for Stochastic Optimization. Proceedings of the International Conference for Learning Representations, San Diego (ICLR) ] with a weight decay parameter of 10 ^-5 and a learning rate parameter to be determined by grid search. A cost function, or error function, is used. For example, we use a function of the type "binary cross-entropy", or cross-entropy, with logits, defined by the following equation:

or
- y _i represents the class predicted during step C;
- t _i represents the real class given by the label associated with the training sample used.

La première phase d’entraînement Ph1 (comportant les étapes A, B, C et D répétées de façon itérative) est mise en œuvre jusqu’à ce que les fonctions d’erreur n’indiquent plus d’erreur significative, c’est-à-dire lorsque jusqu’à ce que les erreurs de prédiction (sous-étape C-3) satisfassent un critère d’arrêt, pour tous les réseaux de neurones. En pratique, on cherche à réduire l'erreur au maximum. Un critère d'arrêt classique peut être le suivant: pas de changement à x décimales de la fonction d'erreur sur les n dernières évaluations, avec par exemple x = 4 décimales et n = 10 itérations. Les poids des connexions entre neurones de ces réseaux encodeurs RNe₁à RNe_Nsont alors figés.The first training phase Ph1 (comprising steps A, B, C and D repeated iteratively) is implemented until the error functions no longer indicate a significant error, that is - that is, when until the prediction errors (substep C-3) satisfy a stop criterion, for all neural networks. In practice, we try to reduce the error as much as possible. A classic stopping criterion can be the following: no change at x decimal places of the error function over the last n evaluations, with for example x = 4 decimal places and n = 10 iterations. The weights of the connections between neurons of these encoder networks RNe ₁ to RNe _N are then fixed.

La première phase d’entraînement Ph1 est suivie d’une deuxième phase Ph2 d’entraînement du réseau de neurones de décodage RNd du décodeur DCD. Lors de cette deuxième phase d’entraînement, Ph2, le réseau de neurones décodeur RNd est entraîné, de manière supervisée, sur des codes fournis par le réseau de neurones encodeur RNe₁préalablement entraîné lors de la première phase d’entraînement Ph1 et dont les connexions entre neurones sont désormais figées. La figure 3 représente l’architecture utilisée lors de la deuxième phase d’entraînement. Les représentations obtenues dans les codes en sortie du réseau de neurones encodeur RNe₁sont utilisées pour entrainer de manière supervisée le réseau de neurones décodeur spécifique RNd selon la tâche cible à réaliser (par exemple identification de piéton). La deuxième phase d’entraînement Ph2 est mise en œuvre par un deuxième système d’entraînement.The first training phase Ph1 is followed by a second phase Ph2 for training the decoding neural network RNd of the DCD decoder. During this second training phase, Ph2, the decoder neural network RNd is trained, in a supervised manner, on codes supplied by the encoder neural network RNe ₁ previously trained during the first training phase Ph1 and whose connections between neurons are now frozen. FIG. 3 represents the architecture used during the second training phase. The representations obtained in the codes at the output of the encoder neural network RNe ₁ are used to train in a supervised manner the specific decoder neural network RNd according to the target task to be performed (for example pedestrian identification). The second training phase Ph2 is implemented by a second training system.

Le réseau de neurones RNd est entrainé, de façon connue, par exemple par un algorithme supervisé de descente de gradient tel que ADAM. Dans un exemple de réalisation, le réseau de neurones RNd est un réseau de neurones de type perceptron multicouches, doté de fonctions d’activation ReLu.The neural network RNd is trained, in a known manner, for example by a supervised gradient descent algorithm such as ADAM. In an exemplary embodiment, the neural network RNd is a neural network of the multilayer perceptron type, provided with ReLu activation functions.

De façon alternative, le décodeur DCD peut aussi être implémenté par une machine à supports de vecteurs.Alternatively, the DCD decoder can also be implemented by a vector support machine.

Comme précédemment, le décodeur DCD est avantageusement entrainé sur un large volume de codes, obtenus en présentant au réseau de neurones encodeur RNe₁un grand nombre d’images capturées par un dispositif optique de référence qui est soit le dispositif optique C1 soit un dispositif optique similaire. L’acquisition de ces images peut se faire lors d’une session de roulage avec véhicule d’acquisistion d’images équipé du dispositif optique de référence. Ces images doivent être labélisées par un expert humain, selon la tâche à accomplir.As previously, the DCD decoder is advantageously trained over a large volume of codes, obtained by presenting to the encoder neural network RNe ₁ a large number of images captured by a reference optical device which is either the optical device C1 or an optical device. similar. The acquisition of these images can be done during a driving session with an image acquisition vehicle equipped with the reference optical device. These images must be labeled by a human expert, depending on the task at hand.

L’architecture telle qu’illustrée sur la figure 3, après entrainement des réseaux de neurones RNe₁et du décodeur DCD, correspond au système final embarqué dans le véhicule V.The architecture as illustrated in FIG. 3, after training of the neural networks RNe ₁ and of the DCD decoder, corresponds to the final system on board the vehicle V.

L’invention présente un intérêt par rapport aux méthodes pleinement supervisées, soit en diminuant la base d’apprentissage pour obtenir des performances équivalentes à moindre coût de labélisation, soit en gardant la même base d’apprentissage mais en obtenant des performances supérieures à coût égal de labélisation.The invention is of interest compared to fully supervised methods, either by reducing the learning base to obtain equivalent performance at a lower cost of labeling, or by keeping the same learning base but obtaining higher performance at the same cost. labeling.

La présente invention concerne aussi un système de configuration du dispositif d’imagerie 1 d’un véhicule automobile, ledit dispositif d’imagerie comportant le dispositif optique C1 de capture d’images, le réseau de neurones encodeur RNe₁associé et le décodeur DCD spécifique à une tâche cible, configuré pour mettre en œuvre le procédé de configuration précédemment décrit.The present invention also relates to a system for configuring the imaging device 1 of a motor vehicle, said imaging device comprising the optical image capture device C1, the _{associated encoder neural network RNe 1} and the specific DCD decoder. to a target task, configured to implement the configuration method described above.

Ce système de configuration comporte un premier système d’entraînement du réseau de neurones encodeur RNe₁par la mise œuvre d’un processus itératif qui comprend, pour chaque itération d’indice i :
A) la fourniture de N images IM_n,i, n variant de 1 à N, respectivement à N réseaux de neurones encodeurs RNe_n, incluant le réseau de neurones RNe₁associé au dispositif optique C1, en tant que réseau cible, et N-1 réseaux de neurones annexes RNe₂,…, RNe_N, les N images ayant été préalablement capturées par N dispositifs optiques Cn de capture d’images, incluant un dispositif optique de référence correspondant au dispositif optique C1, ayant un champ de vision de référence, et N-1 dispositifs optiques annexes Cn, avec n allant de 2 à N, montés sur le véhicule de façon décalée par rapport au dispositif optique de référence de manière à avoir des champs de vision respectifs présentant un taux de recouvrement avec le champ de vision de référence supérieur à 70%;
B) le codage de l’image IM_n,ifournie en un code descripteur CD_n,i, par chaque réseau de neurones encodeur RNe_n,
C) pour chacune des N-1 paires d’images incluant l’image IM_1,iet l’image IM_n,iavec n allant de 2 à N,
- le calcul de la distance entre les deux codes descripteurs correspondants CD_1,iet CD_n,i,
- la prédiction d’une classe, représentative d’une information de correspondance ou de non-correspondance des images IM_1,iet IM_n,ide la paire, par classification binaire de ladite distance, et
- le calcul d’une erreur entre la classe prédite et une classe réelle donnée par une étiquette cible associée à ladite paire d’images IM_1,iet IM_n,iet préalablement connue;
D) l’ajustement des poids de connexion entre neurones des réseaux de neurones encodeurs RNe₁et RNe_n, en fonction de ladite erreur,
comme précédemment décrit.This configuration system comprises a first system for training the encoder neural network RNe ₁ by implementing an iterative process which comprises, for each iteration of index i:
A) the supply of N images IM _{n, i} , n varying from 1 to N, respectively to N encoder neural networks RNe _n , including the neural network RNe ₁ associated with the optical device C1, as target network, and N -1 networks of annexed neurons RNe ₂ ,…, RNe _N , the N images having been previously captured by N optical image capture devices Cn, including a reference optical device corresponding to the optical device C1, having a field of view of reference, and N-1 ancillary optical devices Cn, with n ranging from 2 to N, mounted on the vehicle offset with respect to the reference optical device so as to have respective fields of vision exhibiting a rate of overlap with the field of reference vision greater than 70%;
B) the encoding of the image IM _{n, i} supplied in a descriptor code CD _{n, i} , by each encoder neural network RNe _n ,
C) for each of the N-1 pairs of images including image IM _{1, i} and image IM _{n, i} with n ranging from 2 to N,
- the calculation of the distance between the two corresponding descriptor codes CD _{1, i} and CD _{n, i} ,
- the prediction of a class, representative of correspondence or non-correspondence information of the images IM _{1, i} and IM _{n, i} of the pair, by binary classification of said distance, and
the calculation of an error between the predicted class and a real class given by a target label associated with said pair of images IM _{1, i} and IM _{n, i} and previously known;
D) the adjustment of the connection weights between neurons of the encoder neural networks RNe ₁ and RNe _n , as a function of said error,
as previously described.

Le système de configuration comporte également un deuxième système d’entraînement configuré pour entraîner le décodeur DCD, de manière supervisée, sur des codes fournis par le réseau de neurones encodeurs RNe1 préalablement entraîné.The configuration system also includes a second training system configured to train the DCD decoder, in a supervised manner, on codes provided by the previously trained RNe1 encoder neural network.

L’entraînement des réseaux de neurones est réalisé en débarqué, ou «offline».The training of neural networks is carried out off-line, or "offline".

Bien que les objets de la présente invention aient été décrits en référence à des exemples spécifiques, diverses modifications et/ou améliorations évidentes pourraient être apportées aux modes de réalisation décrits sans s’écarter de l’esprit et de l’étendue de l’invention.Although the objects of the present invention have been described with reference to specific examples, various obvious modifications and / or improvements could be made to the described embodiments without departing from the spirit and scope of the invention. .

La présente invention concerne aussi un programme d’ordinateur comprenant des instructions de code de programme pour contrôler l’exécution des étapes du procédé de configuration précédemment décrit, lorsque ledit programme est exécuté sur un ordinateur.The present invention also relates to a computer program comprising program code instructions for controlling the execution of the steps of the configuration method described above, when said program is executed on a computer.

Claims

Method of configuring an imaging device (1) of a motor vehicle comprising an optical image capture device C1 and an _{associated encoder neural network RNe 1} , comprising a training phase of the encoder neural network RNe ₁ comprising an iterative process which comprises, for each iteration of index i, the steps of:
A) supply of N images IM _{n, i} , n varying from 1 to N, respectively to N encoder neural networks RNe _n , including the neural network RNe ₁ associated with the optical device C1, as target network, and N- 1 neural networks RNe ₂ ,…, RNe _N , the N images having been captured beforehand by N optical image capture devices Cn mounted on an image acquisition vehicle, including a reference optical device, corresponding to the optical device C1, having a reference field of vision, and N-1 ancillary optical devices Cn, with n ranging from 2 to N, mounted on the acquisition vehicle offset from the reference optical device so as to have respective fields of vision exhibiting an overlap rate with the reference field of vision greater than 70%;
B) encoding of the image IM _{n, i} supplied in a descriptor code CD _{n, i} , by each encoder neural network RNe _n ,
C) for each of the N-1 pairs of images including image IM _{1, i} and image IM _{n, i} with n ranging from 2 to N,
- calculation of the distance between the two corresponding descriptor codes CD _{1, i} and CD _{n, i} ,
- prediction of a class, representative of correspondence or non-correspondence information of the images IM _{1, i} and IM _{n, i} of the pair, by binary classification of said distance, and
calculation of an error between the predicted class and a real class given by a target label associated with said pair of images IM _{1, i} and IM _{n, i} and previously known;
D) adjustment of the connection weights between neurons of the encoder neural networks RNe ₁ and RNe _n , as a function of said error.

Method according to Claim 1, characterized in that, during step C), a Euclidean distance is calculated between the two descriptor codes CD _{1, i} and CD _{n, i} .

Method according to claim 1 or 2, characterized in that the distance between the two codes CD _{1, i} and CD _{n, i} is calculated by a distance neuron, which is connected to two binary output neurons, corresponding respectively to the two alternatives same scene and different scenes, the network learning to categorize the pair of images as identical / different, on the sole basis of the distance calculated between the two codes.

Method according to one of claims 1 to 3, characterized in that it comprises the generation of a training database comprising a step of capturing images by said N optical devices Cn, including the reference optical device , corresponding to the optical device C1, and the N-1 ancillary optical devices Cn, with n ranging from 2 to N, during at least one driving session of the acquisition vehicle.

Method according to Claim 4, characterized in that it comprises the steps of:
- generation of positive training samples, each positive training sample comprising at least one image captured by the reference optical device corresponding to the optical device C1 and an image captured by one of the auxiliary optical devices C2 to CN, said images corresponding to the same scene;
- creation, for each positive training sample, of an associated label indicating that the images of said training sample correspond to the same scene,
said steps of generating positive training samples and creating associated labels being implemented using time stamp data of the images.

Method according to claim 4 or 5, characterized in that it comprises the steps of
- generation of negative training samples, each negative training sample comprising at least one image captured by the reference optical device corresponding to the optical device C1 and an image captured by one of the auxiliary optical devices C2 to CN, said images corresponding to different scenes;
- creation, for each negative training sample, of an associated label indicating that the images of said training sample correspond to different scenes,
said steps of generating negative training samples and creating associated labels being implemented using time stamp data of the images.

Method according to one of claims 1 to 6, characterized in that the imaging device of the motor vehicle comprises a decoder for performing a specific decoding task, and another training phase is provided during which said decoder is driven. , in a supervised manner, on descriptor codes supplied by the target encoder neural network RNe1 previously trained.

Method according to Claim 7, characterized in that the decoder comprises a neural network for performing said specific decoding task.

Method according to one of claims 1 to 8, characterized in that the optical device C1 is of one of the types comprising a camera, a lidar and a radar, and the auxiliary optical devices C2 to CN are of the same type as the optical device C1.

System for configuring an imaging device of a motor vehicle comprising an optical image capture device C1 and an _{associated encoder neural network RNe 1} , comprising a system for training the encoder neural network RNe ₁ by the implementation of an iterative process which includes, for each iteration of index i:
A) the supply of N images IM _{n, i} , n varying from 1 to N, respectively to N encoder neural networks RNe _n , including the neural network RNe ₁ associated with the optical device C1, as target network, and N -1 networks of annexed neurons RNe ₂ ,…, RNe _N , the N images having been previously captured by N optical image capture devices Cn, including a reference optical device corresponding to the optical device C1, having a field of view of reference, and N-1 ancillary optical devices Cn, with n ranging from 2 to N, mounted on the vehicle offset with respect to the reference optical device so as to have respective fields of vision exhibiting a rate of overlap with the field of reference vision greater than 70%;
B) the encoding of the image IM _{n, i} supplied in a descriptor code CD _{n, i} , by each encoder neural network RNe _n ,
C) for each of the N-1 pairs of images including image IM _{1, i} and image IM _{n, i} with n ranging from 2 to N,
- the calculation of the distance between the two corresponding descriptor codes CD _{1, i} and CD _{n, i} ,
- the prediction of a class, representative of correspondence or non-correspondence information of the images IM _{1, i} and IM _{n, i} of the pair, by binary classification of said distance, and
the calculation of an error between the predicted class and a real class given by a target label associated with said pair of images IM _{1, i} and IM _{n, i} and previously known.
D) the adjustment of the connection weights between neurons of the encoder neural networks RNe ₁ and RNe _n , as a function of said error.