EP2062196A1

EP2062196A1 - Method of framing an object in an image and corresponding device

Info

Publication number: EP2062196A1
Application number: EP07823796A
Authority: EP
Inventors: Christophe Garcia; Stefan Duffner
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2006-09-14
Filing date: 2007-09-10
Publication date: 2009-05-27
Also published as: WO2008031978A1

Abstract

The invention relates to a method of framing an object in an image (I), said object belonging to a category of objects exhibiting common distinctive characteristics, and said method using an artificial neural net subject previously to a learning phase, characterized in that it comprises the steps of: - locating (b1) said object in said image (I), so as to obtain a first framing (C1) of said object defining a piece (T) of said image, - applying (b2) said piece (T) of the image as input to said neural net, and obtaining as output transformation coefficients making it possible to obtain a second framing (CF) of said object, said learning phase having trained said net to provide as output transformation coefficients allowing a reframing on the basis of pieces of learning images.

Description

Method of framing an object in an image and corresponding device

The present invention is in the field of image processing. More specifically, the invention relates to a method of framing an object in an image, using a neural network. In automatic object recognition systems in an image, the detection and location of objects in the image is an essential first step before the recognition phase. This step aims to extract only the parts of the image, or enclosing boxes, containing the objects to be recognized. To work properly, these systems require that each extracted object image be well centered, aligned, and on the same scale, in a fixed-size input window, where the characteristic elements of the object must be object to each other in predetermined positions. These systems are mostly applied to face recognition. This is why the automatic framing of face images is an important issue in the field of facial analysis.

However, existing automatic face detection and location techniques lead to bounding boxes containing extracted faces that are often poorly centered in position and scale in their respective bounding boxes. In addition some extracted faces are rotated in the plane of the image with respect to the frames of their bounding boxes. This results in a significant degradation of the performance of automatic face recognition systems using these existing techniques. The automatic techniques of framing objects, and in particular faces, evolve according to two distinct approaches. The first approach is to detect, after detection of an object in an image, the characteristic elements of the object. For example, after detecting a face, it detects its facial elements such as the eyes, the nose and the mouth of this face. Then we estimate the parameters necessary for the framing of the face, that is to say for example factors of translation, rotation and scale.

Most facial element detectors rely on an analysis of the chrominance of the face and the light gradient, as described, for example, in the article by J.-G. Wang and E. Sung, entitled "Morphology-based Front-View Facial Contour Detection "and published in 2000 in Volume 4 of the magazine" Institute of Electrical and Electronic Engineer (IEEE) Conference on Systems, Man, and Cybernetics ", or in the article by Mr. Yang, D.

Kriegman and N. Ahuja, titled "Detecting Faces in Images: A Survey" and published in January 2002 in the magazine "IEEE Transactions on Pattern Analysis and Machine Intelligence".

Other facial element detectors perform a correlation search using statistical models of each element, usually constructed from a principal component analysis from thumbnail examples of each of the elements to be searched. as described in the article by B. Moghaddam and A. Pentland, entitled "Probabilistic Visual Learning for Object Representation" and published in July 1997 in the magazine "IEEE Transactions on Pattern Analysis and Machine Intelligence".

Other detectors still proceed in two phases: a first detection makes it possible to obtain candidate position constellations for facial elements, then a selection of the best constellation is made from a geometric model that can be deformable. This technique for detecting facial elements is described in the article "Detecting Faces in Pictures: A Survey" mentioned above. Finally, a more recent facial recognition technique developed by the applicants, seeks a simultaneous and simultaneous localization of the facial elements by using a neural network having learned to transform, in one pass, a face image into maps. saliency whose positions of the maxima correspond to the positions of points of interest in the face image provided as input. This technique is detailed in the article "A Connexionist Approach for Robust and Accurate Facial Feature Detection in Complex Scenes" published on the occasion of the conference "Fourth International Symposium on Image and Signal Processing and Analysis (ISPA 2005)" which took place in Zagreb in Croatia.

The second approach to framing objects is to try to locate an object directly in an image, by a deformable model. Thus the article by D. Cristinacce and T. Cootes, entitled "A comparison of shaped facial facial features detectors" and published on the occasion of the conference "6 ^th International Conference on Facial Face and Gesture Recognition 2004", which had takes place in Seoul, Korea, describes "Active Models of Appearance" (AAM). It involves matching an active face model, by iterative deformations, in position, shape and texture, on a face in an image, by adapting the parameters of a linear model combining form and texture. This active face model is learned from a set of faces on which points of interest are annotated, and from a principal component analysis of the vectors encoding the position of the points of interest and the luminous textures of the faces. associates. Once minimized the correspondence error between the face model and the face present in the image, are retained the parameters of geometric transformations performed iteratively during the matching phase such as translations along certain axes, an angle of rotation in the plan, and a scale factor.

These framing techniques have a number of disadvantages. Indeed, the first three types of facial element detectors, using the chrominance of the face to locate, models statistics or geometric models, are not very robust to noise affecting the image of the face. In particular, the detectors based on chrominance analysis, that is to say, which filter on the "flesh" shade, are particularly sensitive to lighting conditions. In addition, they can not be applied to grayscale images. As for detection systems based on statistical or geometric models, these do not withstand the extreme conditions of illumination, such as over-lighting, under-lighting, side or bottom lighting. These systems are also sensitive to poor quality images, for example low resolution images from video streams, or previously compressed images.

In addition, these first types of detectors rely on independent detections of facial elements and generally fail to locate a face in an image when some of the facial features of that face are obscured. This is the case for example if the face is partially masked by black glasses, beard, a hand in front of his mouth, or if the image has suffered severe local damage. Failure in detecting multiple or even single elements is usually not corrected by the later use of a geometric face model. The latter is only used when choosing between several candidate positions, which must have been detected in the previous step.

The facial image detector recently developed by the applicants uses a convolutional type neural network, which makes it robust to the noises that can affect the images submitted to the detector, and generally makes it possible to overcome partial occlusions of faces in images. However, the face frames that are obtained with this detector are not quite insensitive to partial occultations of faces. In addition, the neural network used by the detector is designed to learn to detect points of interest in an image containing a face, which means that its learning does not focus on the subject. framing of the face itself. The location of faces in an image rendered by this detector is therefore approximate.

Methods based on active face models, which allow for a global search for elements using both shape and texture information, rely on a slow and unstable optimization process that depends on hundreds of parameters that it uses. It is to determine iteratively during the search. Moreover, since these statistical models are linear, they are not very robust to global variations of the image, notably the variations of lighting. They are also not very robust to partial occultations of the face. Moreover these face models are designed for the analysis of faces learned and offer little generalization capacity in the case of unknown faces.

It is an object of the present invention to overcome the disadvantages of the prior art by providing a method and apparatus for framing an object in an image using a neural network.

To this end, the invention proposes a method of framing an object in an image, said object belonging to a category of objects having common distinguishing characteristics, and said method using an artificial neural network subjected to a phase of d learning, characterized in that it comprises the steps of:

Locating said object in said image, in order to obtain a first framing of said object defining a piece of said image,

- Application of said piece of image at the input of said neural network, and obtaining at the output of transformation coefficients making it possible to obtain a second frame of said object, said learning phase having caused said network to output transforming coefficients enabling a crop from pieces of learning images. Thanks to the invention, object extractions are automatically obtained which are well centered and on the same scale in the corresponding frames resulting from the framing method according to the invention. This allows, when the invention is applied to faces, to use the existing automatic face recognition systems optimally. The method according to the invention also makes it possible to improve the performance of other facial analysis systems, such as a facial element detector, by applying to the input of these systems the images of faces resulting from the framing method. according to the invention. It is further noted that this method makes it possible to overcome, for obtaining the second frame itself, a particular method of locating objects in an image. That is why this second frame gives better results than the framing from the facial features detector recently developed by the applicants. In addition, the registration method according to the invention does not use a manually parameterized filter, as is frequently used in image processing, which contributes to obtaining a generalized solution for all types of faces, unlike techniques using face models. assets for example.

According to a preferred characteristic, said neural network is a heterogeneous layer neuron network comprising at least one hidden convolution layer.

This choice of use of a convolutional type neural network makes it possible to obtain a high-performance and robust noise-reduction method that can affect the processed images, while minimizing the time required for the learning phase of the neural network. Indeed, the use of a single network of neurons MLP, according to the English "Multi Layer Perceptron", require a number of connections between neurons very important and therefore a longer learning time.

According to another preferred feature, said neural network is a heterogeneous layer neural network comprising two layers hidden convolution between which is interposed a sub-sampling layer.

This choice of architecture of the neural network improves the performance of the registration method according to the invention, with respect to a heterogeneous neural network comprising a single hidden convolution layer.

According to another preferred feature, said neural network comprises six layers including four hidden layers, an input layer and an output layer. This choice of architecture of the neural network is optimal and allows, by its limited number of layers, to reduce the risk of "over-learning" or "learning by heart" of the neural network.

According to another preferred characteristic, the location step uses a heterogeneous layer neural network comprising at least one hidden convolution layer.

The use of a convolutional type neural network for the localization step gives the framing method according to the invention all the advantages of a robust and efficient localization method. In particular the framing method according to the invention is thus more robust to all the noises that can affect the image, such as poor resolution or significant variations in illumination and contrast. It is also powerful for framing for example faces in various poses, oriented in the plane of the image or non-frontal. The method is also effective when used on faces with various facial expressions, or which contain blackout elements, such as glasses or a beard.

According to another preferred characteristic, said transformation coefficients at the output of the neural network comprise:

a translation coefficient along a first axis of said first frame, a translation coefficient along a second axis of said first frame, a coefficient of rotation with respect to the center of gravity of said first frame,

- and a scaling coefficient.

Thus the second frame obtained in the application step, taking into account all these coefficients, results in a frame in which the previously located object is centered and at a predetermined scale in this frame, but is also always turned in the plane. at the same position. This facilitates the recognition of certain objects, for example the recognition of a face in an image when it is turned in the plane. The invention also relates to a device for framing an object in an image, said object belonging to a category of objects having common distinguishing characteristics, and implementing the framing method according to the invention.

The invention also relates to a computer program comprising instructions for implementing the framing method according to the invention, when it is executed on a computer.

The device for framing an object in an image, as well as the computer program, has advantages similar to those of the method according to the invention.

Other features and advantages will appear on reading a preferred embodiment described with reference to the figures in which:

FIG. 1 represents a network of neurons used by the method according to the invention,

FIG. 2 represents different phases to which this network of neurons is subjected,

FIG. 3 represents a computer equipment implementing the method according to the invention, FIG. 4 represents various stages of a phase of use of the neural network, FIG. 5 represents framing in an image, obtained during this phase of use,

FIG. 6 represents a piece of image resulting from one of the preceding frames, FIG. 7 represents an enlargement of a central part of this piece of image,

FIG. 8 represents the structure of an artificial neuron,

FIG. 9 represents the different steps of a learning phase to which the neural network used by the method according to the invention is subjected.

According to a preferred embodiment of the invention, the method of framing objects in an image according to the invention uses a network of RES neurons shown in FIG. 1. This neural network is composed of several heterogeneous layers, containing at least one both convolutional layers and more conventional layers used in MLP neural networks. In this embodiment, the neural network RES has six layers of neurons, including a first input layer E, a first hidden convolution layer Ci, a second hidden sub-sampling layer S ₂ , a third hidden layer of convolution C ₃ , a fourth hidden layer of MLP type N ₄ neurons and a final S output layer. It is possible to use more hidden layers, but a large number of hidden layers makes the object framing process implemented. by the network of neurons too complex: the neural network in this case risk learning noise, a problem called "over-learning".

In a variant, the neural network RES is an MLP neuron network. This variant embodiment requires a longer learning phase than in the preferred embodiment of the invention, since the number of connections between neurons is then much higher. In addition, in this embodiment, we work on gray scale coded pixel images, because we do not use, in this embodiment, the colors of the objects to be framed. This is why the first layer E has as many neurons as pixels contained in the image applied at the input of the neural network RES, bias excluded.

Alternatively, we work on pixel color images, coded according to the color coding system Red / Green / Blue RGB according to English

"Red Green Blue". In this variant, the first layer E then has as many neurons as three times the number of pixels contained in the image applied at the input of the neural network RES, bias excluded.

In another variant, we work on images whose colors are encoded according to other color coding systems, for example the hue / saturation / HSV value coding system according to the English "Hue

Saturation Value ", or the chrominance systems of the International Commission on Illumination (CIE) The ^* b ^* and LuV, or else the systems used in television standards such as YUV, YIQ, and YCbCr.The number of neurons in the input layer E is then equal to the number of dimensions used by the chosen color coding system, multiplied by the number of color points contained in the input image of the RES neural network, bias excluded.

The detailed operation of each of the layers of the neural network RES will be described later.

The neural network RES is subjected, prior to its use by the registration method according to the invention during the use phase φ2 shown in FIG. 2, to a learning phase φ1.

The method according to the invention is typically implemented in a software manner in an ORD computer, represented in FIG. 3. The ORD computer implemented for example the φ1 learning phase in a MAP learning module, and the use phase in a framing module MC. Each of these modules implements the RES neural network. The framing module MC also implements a method of locating faces in an image I.

The learning phase φ1, detailed below, makes it possible to drive the neural network RES, from a piece of image defining a first object frame and applied as input to the neural network, to output transformation coefficients making it possible to obtain a second framing of this object in the complete image associated with the piece of image. At the end of this learning phase, we arrive at values of weight W and bias B of the neural network, which make it possible to obtain such coefficients. The neural network RES is then ready to be used during the use phase φ2 to provide, from an image I containing a face, a CF frame of this face according to the invention.

In this embodiment, the learning phase is performed from a database of BDD learning images, these images containing faces, because the framing method is used in this embodiment to frame faces. However, the method of framing objects according to the invention can be used to frame any other type of object having common distinguishing characteristics, for example to frame cars in an image. The learning phase φ1 must then in this other example cause the network of RES neurons on images containing cars.

The principle of the method of object registration according to the invention is now described with reference to FIGS. 4 to 7.

Once the preliminary learning phase φ1 has been performed, the neural network RES enters the utilization phase φ2 in which it is operational for framing faces present in gray-scale images of pixels. In this phase of use φ2, the use of the neural network RES to frame a face in an image comprises three steps b1 to b3 shown in FIG. The first step b1 is a step of locating faces in an image I. The image I is subjected to a method of locating faces, giving approximate locations of the faces present in the image I in the form of bounding boxes. It is assumed here that the image I has only one face. Thus, at the output of the localization method, a bounding box defining a first frame C1 of the face, represented in FIG. 5, is obtained.

Several localization methods can be used in this step b1, using for example a filtering on the skin tint, or a principal component analysis of the image I. In this embodiment, the localization method described in the article of the invention is used. C. Garcia and M. Delakis, titled "Convolutional Face Finder: a Neural Architecture for Fast and Robust Face Detection" and published in the IEEE magazine "Transactions on Pattern Analysis and Machine Intelligence" in November 2004. This location method also uses a neural network of convolutional type. It makes it possible to steadily locate faces of at least twenty pixels by twenty pixels, which are rotated in the plane between -30 degrees and +30 degrees relative to a face that would be vertical in the image, and which are shot partially. in profile between -60 and +60 degrees compared to a face entirely face. This localization method is also effective in complex background scenes with variable lighting, and partially obscured faces. The choice of this location method increases the robustness of the framing method according to the invention for framing faces turned in the plane or in profile, partially obscured or in scenes with unfavorable conditions. However, this location method only makes it possible to obtain vertical bounding boxes in the image containing the faces to be located.

The bounding box obtained by the localization method is then extracted from the image I and resized to the input size of the neural network RES, represented in FIG. 1, that is to say that it is resized to have a height H of 56 pixels and a width L of 46 pixels. These values are chosen so as to allow the registration method according to the invention to operate with most of the images applied at the input of the neural network RES.

The bounding box thus extracted and resized forms a piece of image T, represented in FIG. 6, applicable to the input of the neural network RES.

The second step b2 is a step of applying the image piece T to the input E of the neural network RES. At the output S of the neural network, we obtain transformation coefficients, T _r x, T _r y, α _r and S _r c given by the four neurons of the output layer S. The values of these transform coefficients T _r x , T _r y, α _r and S _r C are reduced values between -1 and 1, and must be scaled to the image piece T applied to the input of the RES neural network to obtain the coefficients corresponding non-reduced transformers, Tx, Ty, α and Sc. The inverse formula of that used in the learning phase is used to obtain the reduced values T _r x, T _r y, α _r and S _r c, as detailed later in relation to this phase. It is also necessary to take into account the resizing carried out in step b1, in order to obtain values of transform coefficients at the real scale of the image I.

The last step b3 is a step of reframing the face in the image I. Assuming for simplicity that the transformation coefficients Tx, Ty, α and Sc at the output of the neural network RES are at the real scale of the image I ₁ is performed as shown in Figure 5 and on the enlargement V of Figure 7:

a rotation r of -α degrees of the first frame C1 with respect to the center λ of the first frame C1, a translation tι of value -Tx pixels of the first frame C1 along a horizontal axis AX of the image I ₁

a translation t ₂ of value -Ty pixels of the first frame C1 along a vertical axis AY of the image I, - a scaling e of the first frame C1 by multiplying its dimensions by a factor 1 / Sc.

At the end of this last step b3, a second frame CF of center μ is obtained. It is then found, for most of the tests carried out with the registration method according to the invention, that the face located in step b1 in image I is better centered and scaled in the second frame than in the first framing. In addition the second frame follows the orientation of the face unlike the first frame.

Detailed operation of the RES neural network is now described in connection with FIGS. 1 and 8.

The input layer E of the RES neural network is designed to receive a grayscale image of height H equal to 56 pixels and width L equal to 46 pixels. It therefore contains a matrix of 46 * 56 neurons whose input values e ,, are defined as follows:

¹¹ 128 'where βj _j is the input value of a neuron of the input layer E corresponding to the value p _(J of a pixel of the image applied to the input of the neural network. this pixel is coded in gray scale on a scale of values ranging from 0 to 255. The indices i and j respectively correspond to the line and column indices of the matrix of 46 * 56 neurons.

The input values e _tJ of the RES neural network are therefore between -1 and 1. It should be noted that the neurons of the input layer are not real neurons, in the sense that their output values are the same as their input values.

Other neurons RES neural network operate in a conventional manner, as shown in Figure 8, which shows a neuron n _s a layer of the network, connected to neurons n _e i, n _β2, n _e 3 to n _in of a previous layer, respective output values xi, X2, X3 to X _n . The neuron n _s is connected to the n i _θ neurons, n _e2, n _θ 3 to n _in by as many links called synapses, which are associated with weights wi, _W2, W3 to _Wn. The neuron n _s also has a bias of value b ₀ . Bias and weight values are learned by the neural network during the learning phase φ1. The output y of the neuron n _s value is deducted from the neuron output values _θ n i, n _e2, n _β 3 to n _in after passing through a summing function Σ and an activation function Φ of the following way: where p is an index varying from 1 to n, n being the number of neurons of the layer preceding the neuron n _Si

Wp is the weight of the synapse between the neuron n _ep and the neuron n _s , Xp is the output value of the neuron n _ep , bo is the value of the bias associated with the neuron n _s , and Φ is the activation function of the neuron n _s neuron n _s .

The neurons of the C ₁ and C ₃ layers have a linear activation function, defined by the equation:

Φ ( ^χ ) = x, where x is the variable of the activation function Φ.

The neurons of the S ₂ , N ₄ and S layers have a sigmoid activation function defined by the equation:

Φ (x) = tanh (jc), where x is the variable of the activation function Φ, and tanh is the hyperbolic tangent function.

The first hidden convolution layer Ci consists of 40 cards of 40 * 50 neurons corresponding to 40 images resulting from the convolution of the image applied at the input with 40 convolution nuclei, which are matrices of 7 * 7 weight values. That is to say that each of the neurons of the layer Ci is connected to only 7 * 7 neurons of the layer E, and not to all the neurons of the layer E as it would be the case if the layer Ci was a layer MLP type. Moreover the 7 * 7 synapses of this connection and the same bias are shared by all the neurons of a map of the layer d. For example in Figure 1, which shows to simplify only four cards of 40 * 50 neurons in the layer Ci, the neuron Cκι of the map Cn is connected to all the pixels squared W ^ of neurons of the layer E. More precisely, the output value of the neuron c _k ι of the card Cn is given by the formula: ek + u, l + v ^ ^W kl where

y kl is the output value of the neuron C _k i, the indices k and I are the row and column indices of the neuron C _k i in the map C ₁₁ , wii (u, v) is the weight value located at the line index u and the column index v of matrix 7 ^* 7 forming the convolution core associated with the card

C ₁₁ , the indices u and v being integers varying from 0 to 6, βk + u.i + v is the input value of the neuron located at the line index k + u and the column index k + v of the matrix of 56 * 46 neurons of the input layer

E, and bn is the bias shared by all the neurons of the C ₁₁ map. The first hidden layer of convolution Ci is similar to a detector of certain low-level shapes in the input map such as corners, or contrast-oriented lines. The 40 neural maps of the layer C ₁ are of reduced height Hi with Hi = H-7 + 1, and of reduced width Li with Li = L-7 + 1, in order to prevent the edge effects of the convolution.

As a variant, if the images used are coded for example according to the RGB system, the input layer E is composed of three cards of 56 ^* 46 neurons, each of them coding a color variable of the image applied at the input of the RES neural network. Each of these input cards is connected to the cards of the layer Ci in the same way as in the main variant embodiment of the invention, in which the input layer E has only one 56 ^* 46 card. neurons. In other words, each of the neurons of the layer Ci is connected to three squares of 7 ^* 7 neurons of the layer E. The operation of the other layers is then identical to that of the main variant of realization, which we consider again now.

The second hidden sampling layer consists of 40 maps of 20 * 25 neurons corresponding to 40 images resulting from subsampling of the 40 output images of the 40 neuron maps of the d layer. This sub-sampling is done by connecting a neuron of the S ₂ layer to four neurons of the Ci layer. The maps of the layer S2 are therefore of height H ₂ == Hi / 2, and of width L ₂ = Lι / 2. The weight of the synapses corresponding to these connections is identical for all the neurons of a map of the layer S ₂ . All the neurons of a S ₂ layer map also have a shared bias. For example, in FIG. 1, which merely shows four 20 * 25 neuron maps in the S ₂ layer, the neuron s _min of the map S ₂₂ is connected to all the pixels of the square F _mπ of 2 ^* 2 neurons. of the layer Ci. More precisely, the output value of the neuron s _mn of the card S ₂₂ is given by the formula:

OR

y _mn is the output value of the neuron s _m n, the indices m and n are the row and column indices of the neuron s _mn in the map S ₂ 2., .12

/ 2m + _M , 2n + v is the output value of the neuron located at the line index 2m + u and the column index 2n + v of the map On of 40 * 50 neurons of the layer

C ₁ , w ₂₂ is the value of the shared weight of the synapses connecting the neurons of the card C ₁₂ with the neurons of the card S22, tanh is the hyperbolic tangent function, and D ₂₂ is the bias shared by all the neurons of the card S22-

The third hidden convolution layer C ₃ consists of 39 maps of 16 * 21 neurons corresponding to 39 images each resulting from the sum of the convolutions of two images. These two images correspond to the outputs of two 20 ^* 25 neuron cards of the S ₂ layer, which are each convolved with a convolution core, formed of a matrix of 5 ^* 5 weight values. That is to say that each of the neurons of the layer C ₃ is connected to two squares of 5 * 5 neurons of the layer S ₂ . In addition, the neurons of the same map of the layer C ₃ all have the same bias. For example in FIG. 1, which only shows for simplicity that three cards of 16 ^* 21 neurons in the layer C ₃ , the neuron Zq _r of the card C ₃₁ is connected to all the pixels of the squares G _qr and H _qr of neurons of the layer S ₂ . More specifically, the output value of the neuron Zq _r of the card C ₃₁ is given by the formula: OR

> V is the output value of the neuron Zqr, the indices q and r are the row and column indices of the neuron Zq _r in the card C ₃ i,

W ₃ ii (u, v) is the weight value located at the line index u and the column index v of the matrix 5 * 5 forming the convolution core associated with the convolution of the card C ₃₁ of the layer C3 with the map S21 of the layer S ₂ , the indices u and v being integers varying from 0 to 4, w ₃ i ₂ (u, v) is the weight value situated at the line index u and at the column index v of matrix 5 * 5 forming the convolution core associated with the convolution of the map C ₃₁ of the layer C ₃ with the map S ₂₂ of the layer S ₂ , the indices u and v being variant integers from 0 to 4,

svq ² + ¹ u, r + v is the output value of the neuron located at the line index q + u and the column index r + v of the S21 map of 20 ^* 25 neurons of the S ₂ layer , i; ²² s q + u, r + v is the output value of the neuron located at the line index q + u and the column index r + v of the map S ₂₂ of 20 * 25 neurons of the layer S ₂ , and D ₃₁ is the bias shared by all the neurons of the C ₃₁ card.

The third hidden convolution layer C3 makes it possible to extract characteristics of the input image of higher level than the layer Ci by combining extractions of the maps of the previous layers.

The 39 neuron maps of layer C ₃ are of reduced height Hz = H ₂ -

5 + 1, and of reduced width L ₃ = L ₂ -5 + 1, in order to prevent the edge effects of the convolution. The fourth hidden layer N ₄ corresponds to a conventional MLP network neuron layer. It contains 39 neurons, each of them having its own bias and being connected to each of the neurons of the C3 layer. The synapses corresponding to these connections each have their own weight. Finally, the last output layer S contains a neuron by transformation coefficient, that is to say in this embodiment, four neurons giving the transformation coefficients T _r x, T _r y, α _r and S _r c. Each neuron of the S layer is connected to all the neurons of the N layer ₄ , and has its own bias. Similarly synapses corresponding to these connections each have their own weight.

This neural architecture therefore acts as a cascade of filters making it possible to estimate, for a face image applied at the input of the neural network RES, digital values at the level of the output layer corresponding to the transformation coefficients T _r x, T _r y, <x _r and S &.

It should be noted that the choice of numbers of neural or neural maps in each layer of the RES neural network, as well as the choice of convolution-nucleus sizes, which have been made in this embodiment, correspond to an operation. optimal method of framing according to the invention. However, other choices also make it possible to achieve a satisfactory operation of the registration method according to the invention.

The learning phase φ1 is now detailed in relation with FIG. 9.

The learning phase requires the prior creation of a database of BDD learning images that contain faces. This is done by using greyscale photos of image bases, and manually extracting bounding boxes, each containing a well-centered, well-oriented face on the same scale in its box. Then they are resized with their respective initial images so that the bounding boxes have a size height H and width L corresponding to the input of the neural network RES. In this embodiment of the process according to the invention, 1,500 enclosing boxes containing faces of various appearances are thus extracted.

These resized bounding boxes are then subjected to one or more geometric transformations in their initial resized images, among which:

a horizontal translation Tx varying from 6 pixels to the left and six pixels to the right,

a vertical translation Ty varying from 6 pixels upwards to six pixels downwards, a rotation relative to the center of the image by an angle α varying from -30 degrees to +30 degrees,

- zoom out or before Sc factor ranging from 90% to 110% of facial size.

This gives a database of BDD learning images of two million images of faces, or enclosing boxes, corresponding to a poor framing. Each of the images of faces thus obtained is annotated with the values of the transformation coefficients which have been made to obtain it.

The learning phase φ1 then consists in using this database of BDD training images to cause the neural network RES to output transform coefficients corresponding to transformations to be performed on a well-bounded bounding box to result in the badly framed bounding box corresponding to the image applied at the input of the neural network. In fact, we use only a subset of 30,000 images taken at random from the base of two million images of previously obtained faces.

For this, the synaptic biases and weights of the neural network are randomized to low values between 0 and 1 but different from zero. In the first step ai, we randomly select a subset of 1000 images from the 30,000 learning images selected at the start. This selection serves as a basis for an iteration of the gradient retro-propagation algorithm, which converges to a stable solution after about 200 iterations. An iteration consists in executing steps a1 to a4, that is, steps a1 to a4 of FIG. 9 are repeated 200 times. The gradient retro-propagation algorithm used during the learning phase is known to those skilled in the art.

In a second step a2, at the input of the neural network RES, an image randomly drawn from the subset of 1000 images previously selected is presented. The desired values, which are the reduced values T _r x, T _r y, α _r and S _r c corresponding to the transformation coefficients Tx ₁ Ty, α and Sc annotated to this image, are also applied to the output of the neural network. so that these desired values are between - 1 and 1: i

OR

- D _m is a desired value equal to a reduced value of transformation coefficient T _r x, T _r y, a, or S _r c, - P _n is the non-reduced transformation coefficient Tx, Ty, α or Sc

corresponding to the desired value D _1n ,

- ^ _n Mi "is the minimum allowed for the corresponding parameter P _m , ie -6 pixels for Tx or Ty, -30 degrees for α or 90% for Sc,

and P _m , Maχ is the maximum allowed for the corresponding parameter P _m , ie 6 pixels for Tx or Ty, 30 degrees for α or 110% for Sc. In a third step a3, the RES neural network is propagated from the image applied at the input of the RES neural network, and the output responses of the neurons of the RES network are obtained, making it possible to apply the algorithm of retro-propagation of the gradient. In a fourth step a4, a back propagation of the network of neurons RES is carried out, making it possible to update the synaptic weights and the biases of the network RES. For example, in this gradient retro-propagation algorithm, the following parameters are used:

a learning step of 0.003 for the neurons of the layers Ci and S ₂ ,

a learning step of 0.002 for the neurons of the C3 layer,

a learning step of 0.0005 for neurons of the N layer ₄

a learning step of 0.0001 for the neurons of the layer S,

- and a momentum of 0.2 for all the neurons of the RES network. The purpose of this gradient retro-propagation algorithm is conventionally to minimize the following objective function: or

O is the objective function to be minimized on all the 1000 images presented at the input of the neural network RES during an iteration, each image presentation corresponding to the summation index k in this formula, and on the set of 4 transform coefficients at the output of the neural network RES, each represented by a summation index m varying from 1 to 4, D _m is a desired value applied at the output of the neural network, corresponding to one of the coefficients of transformation T _r x, T _r y, α _r and S _r C, and S _m is the corresponding value obtained at the output of the neural network after the propagation phase. During an iteration, the steps a2 to a4 are repeated cyclically on all of the 1000 images selected in step a1, with the difference that, at the second pass in step a2, a random image of 999 is randomly selected. images not yet applied to the input of the RES neural network, then to the third pass by step a2, an image of the 998 images not yet applied to the input of the RES neural network is randomly selected, and so on.

It should be noted that other alternative embodiments of the method according to the invention can be envisaged, with transformation coefficients defined differently. For example, in a variant, the learning phase causes the neural network RES to provide, from a bounding box poorly framed on a face, transform coefficients that make it possible to perform transformations on this bounding box in order to succeed. to a bounding box well framed on this face. Thus, the transformation coefficients obtained at the output of the neural network RES do not need to be reversed during the use phase.

Claims

1. A method of framing an object in an image (I), said object belonging to a category of objects having common distinguishing characteristics, and said method using a network (RES) of artificial neurons submitted prior to a phase of training (φ1), characterized in that it comprises the steps of: - locating (b1) said object in said image (I), in order to obtain a first frame (C1) of said object defining a piece (T) of said picture,

- applying (b2) said image piece (T) at the input of said network (RES) of neurons, and obtaining at the output of transformation coefficients (α _r , T _r x, T _r y, S _r c) allowing obtaining a second frame (CF) of said object, said learning phase (φ1) having caused said network (RES) to output transforming coefficients (α _r , T _r x, T _r y, S _r c) allowing a crop from pieces of learning images.

2. Method according to claim 1, characterized in that said network (RES) of neurons is a heterogeneous layer neural network comprising at least one hidden convolution layer.

3. Method according to claim 1 or 2, characterized in that said network (RES) of neurons is a heterogeneous layer neural network comprising two hidden convolution layers (Ci, C3) between which is interposed a sub-sampling layer. (S ₂ ).

4. Method according to any one of claims 1 to 3, characterized in that said network (RES) of neurons comprises six layers including four hidden layers (Ci, S ₂ , C ₃ , N ₄ ), an input layer (E) and an output layer (S).

5. Method according to any one of claims 1 to 4, characterized in that the locating step (b1) uses a heterogeneous layer neural network comprising at least one hidden convolution layer.

6. Method according to any one of claims 1 to 5, characterized in that said transformation coefficients (α _r , T _r ×, T _r y, S _r c) at the output of the neural network correspond to:

a translation coefficient (Tx) along a first axis of said first frame,

a translation coefficient (Ty) along a second axis of said first frame, a rotation coefficient (α) with respect to the center of gravity of said first frame,

- and a scaling coefficient (Sc).

7. Device for framing an object in an image (I), said object belonging to a category of objects having common distinguishing characteristics, implementing the method according to any one of claims 1 to 6.

A computer program comprising instructions for implementing the method of any one of claims 1 to 6 when executed on a computer.