WO2008104453A1

WO2008104453A1 - Method of automatically recognizing and locating entities in digital images

Info

Publication number: WO2008104453A1
Application number: PCT/EP2008/051608
Authority: WO
Inventors: Nicolas Allezard
Original assignee: Commissariat A L'energie Atomique
Priority date: 2007-02-16
Filing date: 2008-02-11
Publication date: 2008-09-04

Abstract

The general field of the invention is that of the recognition of entities in digital images. The method of recognition according to the invention comprises a first step of learning and a second step of recognition. The learning step consists in parameterizing a cascade composed of stages of classifiers on the basis of a series of imagettes of the entity to be recognized and of a series of imagettes not containing said entity. The recognition step consists in making the parts of the digital image for which the search is to be performed pass through the cascade of classifiers, the parts of the image that have successfully passed each stage of the cascade being declared containing the entity sought. Each of the various images analyzed by the method is represented by a set of multi-variable local descriptors, preferably consisting on the one hand of a histogram with N components of the intensity gradients in the image as a function of given directions and on the other hand of the sum of the magnitude of the gradient in the image divided by the area of said image.

Description

METHOD FOR RECOGNIZING AND AUTOMATICALLY LOCATING ENTITIES IN DIGITAL IMAGES

The field of the invention is that of the recognition and location of entities in digital images. Digital images are ubiquitous today and many computer tools have been developed to exploit them automatically. The fields of application are very numerous. As examples,

• video surveillance of installations, which notably ensures the detection of suspicious persons or vehicles,

• video-assistance in road transport such as cars, trucks, construction equipment or dumpsters ... where video surveillance warns the driver by automatically detecting the presence of pedestrians or vehicles,

• the indexing of images by automatically determining the presence of certain objects, • the assistance of people at home. We detect, for example, the falls by the recognition of the standing or sitting posture of the supervised persons ...

A lot of work has been done in this area. The best-known methods are based, for example, on the recognition of people, on intensity models according to different attitudes. The publication of C. Curio, J. Edelbrunner et al., Entitled "Walking pedestrian recognition" and published in IEEE Transactions on ITS, Vol. 1 No.3, pages

155-163, Sept. 2000, describes models of this type. In this publication, people are searched by calculating at different scales a distance from the reference models. Some approaches propose, within the same framework, a hierarchy of attitudes. The document by DM Gavrila, J. Giebel, "Shape-based pedestrian detection and tracking", published in Proc. IEEE Intelligent Vehicles Symposiun, 2002. Others, such as H. Nanda's document, L Davis, "Template based pedestrian detection in infrared videos", from Proc. of IEEE Intelligent Vehicles Symposium, 2002 define a probabilistic model to describe possible attitudes. These methods can be further enhanced by combining them with cascade-based approaches of classifiers also known as the classifier. The document by P. Viola, M. Jones et al., "Robust real-time object detection", from the second International workshop on statistical and computational theories of vision, Vancouver, Canada, 2001 proposes such an approach.

In this last publication of Viola, there is described an image detection method comprising a learning phase and a recognition phase. This method is based on the use of Haar wavelet type single-variable descriptors, which are relatively simple shape descriptors (see black and white rectangles in Figure 1 on page 4 of this document). The learning algorithm uses a small number of criteria for classifying images. It scans the image and combines the descriptors to quickly eliminate the parts of the image that do not contain the object to be recognized. This document also describes the use of integral images to speed up calculations. The descriptors used give fast computing times but are too basic to capture the information. A large number of them are needed to discriminate positive images containing the desired object from negative images that do not contain the desired object.

In summary, the publication Viola et al. presents the use of a set of mono-variable descriptors in a cascade of classifiers, where a single-variable descriptor corresponds to each stage of the cascade. This method has the advantage of requiring only rudimentary calculations done on one variable at a time, but it requires calculations on a large number of pixels, and many of these calculations are useless.

To improve the efficiency of the descriptors, the document by N. DaIaI and B. Triggs "Histograms of Oriented Gradients for Human Detection" from International Conference on Computer Vision and Pattern Recognition, 2, 886-893, June 2005 describes a method of detection based on a multi-variable descriptor using instead of Haar wavelets a histogram of orientation gradients known by the acronym HOG, this histogram comprising 9 sectors oriented from 0 degrees to 180 degrees with local standardization of contrasts (see § 6 of this document by N. DaIaI). This method makes it possible to be insensitive to variations in light intensity. This document describes the use of a linear classifier also called SVM, an acronym for Separateur à Vande Marge. This classifier combined with the HOG descriptor makes it possible to reduce the false-positive rate by an order of magnitude compared to the Haar wavelets. This method uses a single detector and not a cascade of classifiers. However, for some applications, this method is not sufficiently discriminating.

The method according to the invention makes it possible to overcome the above disadvantages. It allows to obtain high detection rates while maintaining a high calculation speed. This method makes it possible to determine the presence of previously learned objects and to locate them in the image. Thus, the method according to the invention allows the automatic understanding of the scenes observed where the indexing of the images becomes possible.

The method according to the invention uses a set of multi-variable descriptors to represent an entity to be recognized. This set of multi-variable descriptors corresponding to a large number of components, this number of components being typically at least an order of magnitude greater than for a set of single-variable descriptors. The use of such multi-variable descriptors then poses calculation and choice problems in order to obtain a good compromise between the richness of the descriptors and the speed of calculation.

It consists mainly of two stages. A first step prior to recognition is the learning of the entity to search. This learning consists of setting up a cascade of classifiers from a series of thumbnails of examples representative of the entity to be recognized and a series of thumbnails that do not contain the entity to be recognized. The parameterization consists in defining, for each stage of the cascade, the most discriminating components among a set of local multi-variable descriptors of the image and the associated thresholds making it possible to recognize the searched entities. Finally, the recognition stage consists of passing in the cascade of classifiers the parts of the image for which research must be carried out. The parts of the image that have successfully passed all stages of the cascade are declared containing the desired entity.

According to a preferred embodiment of the invention, a multi-variable local descriptor of the image consists of a histogram of the orientation of the gradients and a density component of the magnitude of the gradient.

One aspect of the invention resides in the judicious parameterization of the cascade, which makes it possible to minimize the number of local multi-variable descriptors used and to optimize their use in the successive stages of the cascade. The criterion used according to the invention at each stage of the cascade for selecting a local descriptor or a subset of local descriptors is its efficiency in statistically separating the images containing an entity to recognize thumbnails not containing them.

This method has many advantages over the prior art methods listed above:

• It is based on new types of multivariable descriptors composed on the one hand, a histogram of gradient direction, possibly weighted by magnitude, and on the other hand a component related to the density of the magnitude. of the gradient in the calculation area. This descriptor is calculated to be invariant to affine changes in brightness;

• The use of integral images allows the quick implementation of the descriptor calculation;

• The learning phase is managed so that an almost exhaustive search is made on the number and location of possible descriptors;

• A quick line-by-line image allows you to limit remote memory access during the recognition phase and saves significant execution time. More specifically, the subject of the invention is a method for recognizing and automatically locating an entity in a digital image composed of pixels each having a luminance level, said pixels forming a matrix of rows and columns, the pixels being indexed (i, j) in said matrix, said method comprising a first so-called learning step and a second so-called recognition step,

The learning step of setting up a cascade composed of stages of classifiers from a series of images of the entity to be recognized and a series of images not containing the entity to be recognized,

The recognition step of passing in the cascade of classifiers the parts of the digital image for which the search is to be made, the parts of the image having successfully passed all the stages of the cascade being declared containing the desired entity, characterized in that each of the different images analyzed by the method is represented by a set of multi-variable local descriptors, and in that the parameterization of the cascade of classifiers consists in selecting for each stage a subset of components among the set of multi-variable local descriptors according to their efficiency in statistically separating the images containing or not containing the entity to be recognized.

Advantageously, each multi-variable local descriptor consists on the one hand of a histogram of orientation of the intensity gradients in the image as a function of given directions comprising N components and on the other hand of the sum of the magnitude of the gradient in the image divided by the surface of said image.

Advantageously, the histogram comprises nine components whose directions vary from 0 degrees to 180 degrees. Advantageously, the calculation of the local multi-variable descriptors comprises the following preliminary steps:

• The coordinates of the image being defined in an orthogonal coordinate system (i, j), calculation of the luminance derivatives in i and j in each pixel of the image; • Calculation of the orientations and the magnitude of the luminance gradient in each pixel;

• Calculation of N orientation images corresponding to the N components of the histogram; • Calculation of the N + 1 integral images corresponding to the N orientation images and the integral image of the magnitude, the integral image of a parameter of a coordinate pixel (i, j) being equal to the sum of the same parameters for the coordinate pixels (x, y) whose x and y coordinates are both less than or equal to i and j, respectively.

Advantageously, the computation of the histogram comprises a step of normalization of the luminance gradient in each pixel, said step being carried out in two successive substeps, said first substep consisting in calculating the standard deviation of the luminance in a sliding window, said second sub-step of multiplying the magnitude of the luminance gradient in each pixel by the ratio between the standard deviation calculated on the sliding window and a standard deviation of reference.

Advantageously, when the standard deviation of the luminance in a sliding window is less than or equal to a minimum value, the second sub-step is deleted.

Advantageously, the components of the histogram are standardized so that their sum is equal to unity.

Advantageously, the classification algorithm is of the AdaBoost type, the classifiers being of weak type, that is to say algorithms capable of discriminating two classes of objects at least as well as chance.

Advantageously, the training of a level of the cascade can be carried out, in a first embodiment, according to the following substeps: • Calculation of all local multi-variable descriptors on previously selected images having or not including the entity to recognize;

• Launch of "combination learning" software, better known as "AdaBoost" on all components of all previous descriptors; • Calculation of the descriptors chosen descriptors from which are extracted the selected components;

• Second launch of the "combination learning" software only on the components of the chosen descriptors;

• Adaptation of the threshold of the final classifier to obtain a desired minimum detection rate.

Advantageously, the driving of a level of the cascade is carried out, in a second embodiment, according to the following substeps:

• Calculation of all multi-variable local descriptors on previously selected thumbnails with or without the entity to be recognized;

• Launch of "learning by combining decisions" software on all components of all descriptors;

• Adaptation of the threshold of the final classifier in order to obtain the desired minimum detection rate.

In addition, each stage of the cascade having a first detection rate of the entity present in a thumbnail and a second detection rate of the entity absent in a thumbnail, the number of stages is such that the overall detection rate of the entity present in a thumbnail by the cascade exceeds 97 percent and that the overall detection rate of the entity absent in a thumbnail is less than one billionth.

Finally, the calculation of the descriptors of a stage as well as the associated classification score is made for a whole line of the image.

The invention will be better understood and other advantages will become apparent on reading the description which follows given by way of non-limiting example and by virtue of the appended figures among which:

FIG. 1 represents the general principle of the method implemented in a digital image;

Figure 2 shows thumbnails of the entity to be recognized; FIG. 3 represents the principle of the normalization of the magnitude of the gradient of the luminance of a pixel of the image;

FIG. 4 represents the principle of calculating the integral image of a pixel; FIG. 5 represents the principle of calculating the integral image of a zone of pixels;

FIG. 6 represents the principle of calculating the histogram of intensity gradients in the image;

Figure 7 shows the operating principle of the recognition step.

The process comprises two steps. In the recognition step, a set of windows is moved in the image to search for the object. This set consists of windows of different sizes because the possible sizes of the object in the image are not known and vary according to the distance of the object to the camera. This step is illustrated in Figure 1. This figure represents a view of a street with silhouettes of people to discriminate among vehicles and homes. The three rectangles surrounded by white are previously discriminated S shapes. The rectangle F surrounded by white with a vertical arrow and a horizontal arrow corresponds to the moving window that can move in the horizontal and vertical directions. Multi-variable local descriptors of the thumbnails contained in each window are calculated and their likelihood is tested. Globally many windows are tested and few of them contain objects of interest. Using a cascade of classifiers, more complex descriptors are gradually added to each stage of the cascade to become more and more selective. The first levels of the cascade eliminate the windows whose contents are little resembling the object to be sought, the following levels calculate descriptors more and more complex and closer and closer to the object to seek. Thus, the number of windows decreases very quickly in the different stages of the cascade, allowing a fast course of the image.

The shape of the objects to be recognized is represented by a set of local multi-variable descriptors, preferably based on the direction and the density of the gradient of the light intensity. In Figure 2, we seek to identify a human face by means of a set of adapted local descriptors represented by the different white rectangles F of Figure 2. As seen in this figure, a first descriptor can correspond to the whole face and descriptors secondary to more specific areas such as the nose, mouth or the eyes. This type of signature has already been used in pattern recognition, for example for the recognition of hand position as described in the WT Freeman and M. Roth publication "Guidance Histograms for Hand Gesture Recognition" published in IEEE Intl. . WkShp. on Automatic Face and Gesture Recognition, Zurich, June 1995 or more recently the detection of human forms. For this last point, see N. DaIaI and B. Triggs, "Histograms of Oriented Gradients for Human Detection," published in International Conference on Computer Vision and Pattern Recognition, 2, 886-893, June 2005.

In the method according to the invention, the appearance of the objects is captured by several local multi-variable descriptors. Each of them is formed on the one hand:

• a histogram of the direction of the gradient, possibly weighted by the magnitude • as well as the magnitude of the gradient divided by the area of the local area.

The signature of the luminous intensity gradient is constructed after the computation of the horizontal and vertical derivatives of the image realized thanks to the recursive algorithm of computation of the derivatives proposed by Deriche in the reference documents titled "Using canny's criteria to derives a recursively implemented optimal edge detector "from The International Journal of Computer Vision, 1 (2): 167-187, May 1987 and" Fast algorithm for low-level vision ", excerpt from IEEE Transactions on Pattern Analysis and Machine Intelligence, 1 (12). ): 78-88, January 1990.

When calculating the signature, the magnitude of the gradient is accumulated in the component of the histogram corresponding to this direction. A histogram having nine components whose directions vary from 0 to 180 degrees is preferentially used. Indeed, we notes that nine orientation directorates represent a good compromise between discrimination and speed of calculation. The histogram is then standard to form a distribution, for that we calculate the sum of the components of the histogram, then we divide each of these components by this sum.

The orientation histogram is enriched by the sum of the magnitude of the gradient in the local area divided by the surface of the area. This additional component informs about the presence of contour in the zone regardless of its orientation. Experience shows that it is very often chosen by the classification algorithm in the first levels of the cascade. It therefore allows a rough but rapid discrimination between the background of the image and the desired patterns.

Then, a normalization of the luminance gradient is realized, since the lighting variations of the scene directly influence the derivatives of the signal. Indeed, in the case of a linear variation of the luminance the (u, v) of the type a.l (u, v) + b, the derivative of the signal is multiplied by a. The same is true for the standard deviation of the luminance signal which is multiplied by the same factor a. Also, prior to the calculation of the signatures, the standard deviation of the luminance signal is calculated in a sliding window that is moved in the image as illustrated in FIG. 3 where the square centered on the pixel P represents the sliding window FG The magnitude of the pixel gradient at the center of the sliding window is then multiplied by the ratio of the standard deviation calculated on the window to a reference standard deviation. In the case where the standard deviation calculated on the window is too small, the magnitude remains unchanged, because to apply the normalization in this case only enhances the noise of the image.

The computation of all the local multi-variable descriptors by the classical method used for the construction of a histogram is too long to allow a real time use of this signature. It is preferable to use the technique of integral images to accelerate this step as it is described in the publication of F. Porkili "Integral Histogram: A fast way to extract histograms in cartesian spaces" from In Proc. IEEE Conf. on Computer Vision and Pattem Recognition (CVPR), 2005.

An integral image is defined as follows. Each pixel of coordinate (i, j) of an integral image imlnt contains the sum of the pixels of the original image im of the line 0 to i and the column 0 to j.

So: imlnt {i, j) = Σ im (x, y)

() <X <ι, 0≤y <j

The definition of an integral image is illustrated in FIG. 4. The set of points represented in dashed lines corresponds to the integral image 1.1. of the pixel A. By this means, the calculation of the sum of the pixels included in a rectangular zone delimited by the points A, B, C and D is always carried out as follows: imlnt (i _a J ₀ ) - imlnt { i _h, j _b) - imlnt _(d i, _d j) + imlnt (i _c, j _c)

This calculation is illustrated in FIG. 5. The calculation of this sum is therefore independent of the size of the area on which one sums. It is this property which is used later for the descriptor calculation.

The component C of a local histogram is equal to the sum of the magnitude of the pixels whose direction of the gradient corresponds to this component C. Thus, in the case of an N-component histogram, N orientation images are defined in which one recopies the magnitude of the gradient according to its direction.

Prior to any descriptor calculations, the following operations must be performed:

The sequence of operations above is illustrated in FIG. 6. From the orientation information O and magnitude M symbolized by diamonds, the N IO orientation images are calculated. corresponding to the N components of the histogram, then the N integral images 1.1.0. corresponding to the N orientation images as well as the integral image 1.1. M. of the magnitude. The calculation of the histogram H on any area of the image is then carried out at fixed cost by the evaluation of N sums corresponding to the N components C. The same goes for the calculation of the magnitude of the gradient on the zone local, the integral image of the magnitude is first calculated to accelerate subsequent calculations.

Thus, as illustrated in FIG. 6, the use of the descriptor containing a histogram of nine components requires the calculation of nine orientation images followed by the calculation of ten integral images, nine corresponding to the orientation images, the last corresponding to the magnitude. .

It is possible to calculate a single component of the histogram independently of the other eight. Indeed, the histogram being weighted by the magnitude of the gradient, the sum of the nine components is equal to the sum of the magnitude of the gradient on the area of calculation. It is not necessary to calculate the nine components to know the sum. The calculation of a single component of the histogram is done first by evaluating the sum of the magnitude corresponding to this orientation and then by dividing this sum by the sum of the magnitude.

To search for entities in the image, you must first have set the different stages of the cascade. This parametrization, also called training, is made from a "ground truth", consisting of a set of thumbnails containing the object to be searched for positive examples and a set of thumbnails containing not such an object, so-called negative examples. According to one embodiment, the number of local multi-vahangable descriptors used is arbitrarily defined for each stage. According to another embodiment, the number of multi-vault local descriptors is chosen in advance for each floor or for certain floors. For example, the first floor may include a descriptor, the second one, the third three ...

Each floor is trained to detect a very large majority of the objects to be searched and in order to minimize the false-positive rate, that is to say the windows recognized as containing the object wrongly. The overall rate of good detection of the cascade is equal to the product of the detection rates of all the stages. The same is true for the false-positive rate. For example, if the cascade includes 30 stages each having a 99.9% detection rate and a false positive rate of 50%, then the detection rate is equal to 0.999 ³⁰ 97% and the rate of false positive is equal to 0.5 ³⁰ , ie 10 ^"9. Thus, 97% of the objects are detected for an error per billion of windows explored Figure 7 schematically illustrates the passage of a set of windows F representing an image in the E stages of the waterfall The E stages are represented by lozenges Each rhombus has one input and two outputs symbolized by arrows in Figure 7. The selected windows F _A pass through the horizontal outlets. F _R passes through the vertical outputs By dimensioning the number of descriptors for each stage, it is possible to size the complexity of the task according to the number of sub-windows remaining to be tested, as shown on the two graphs below. of In Figure 7, the complexity of the task increases as the number of sub-windows decreases.

If we assume 50000 windows per image, then we have, on average, an error every 20000 images.

The number of descriptors, their relative sizes and positions are chosen according to the method described above during the phase learning. The number of descriptors selected depends on the minimum detection rate and the maximum false-positive rate allowed on each floor. This number increases as one progresses in the cascade because of the increasing complexity of the classification task to be performed.

During the learning phase, a large number of descriptors that can be several hundred with different sizes and positions are calculated on the positive and negative examples. This phase is implemented using a powerful classification algorithm called "AdaBoost". "AdaBoost" is the contraction of the terms "Adaptive Boosting" that can be translated as "learning by combination of decisions or doping". The fathers of this algorithm are Yoav Freund and Robert Scapire. For information on this algorithm, see the publications of J. H. Friedman, T. Hastie, and R. Tibshirani entitled "Additive logistic regression: a statistical view of boosting" of the Dept. of Statistics, Stanford University Technical Report.1998 and Y. Singer and R. Schapire entitled "Improved boosting algorithms using confidence-rated predictions" from Machine Learning 37, 237-336. 1999.

This algorithm achieves the minimization of classification errors by progressive additions of "weak classifiers", in this case one-level decision trees, also known by their Anglo-Saxon name of "stump decision".

A weak classifier is an algorithm capable of discriminating two classes of objects at least as well as chance would, that is to say, it does not err more than once in two on average. The classifier provided is then weighted by the quality of its classification: the better it ranks, the more important it will be. At each learning phase is associated a weight, this weight is increased in the case where the example is misclassified, decreased otherwise. The incorrectly classified examples therefore take on more importance with respect to the weak classifier during the different successive learning phases, commonly called "boosting rounds", in order to compensate for the errors made by the previous classifiers. The final classifier thus consists of the weighted sum of the outputs of the weak classifiers. The studied form will be classified positively, that is to say as an object to be recognized if this sum is greater than zero, negatively in the opposite case.

The advantages of this method are numerous: besides the fact that it makes it possible to obtain very good results of classification, it relies on solid theoretical foundations. Moreover the algorithm is robust, fast during the detection phase and finally, parameterizable. The general procedure of the algorithm is described below:

Initialization phase:

Inputs: m examples x, feature vector and their classes y, belonging to the interval {-1, +1};

T being the number of iterations to be performed; h _t (x) being a weak classifier chosen at the iteration t;

Output: H (x) the final classifier.

Initialization of the weights of the examples Di (i) = 1 / m.

Iterations phase:

For t = 1 to T

Training of the weak classifier using the distribution D _t . Choice of the best classifier for iteration t. h, (x): X - »% Choice of α, e iH.

Update of the weights of the examples: D _{1 + 1} (/) = '-' ^y '''

Where Z _t is a normalization factor (chooses so that D _t + i is a distribution).

Final phase :

The final classifier is given by: H (x) ≈ signal] T α, / ι, (x) In the approach of the method according to the invention, the weak classifiers employed are one-node decision trees conventionally called "stump decision". Each component of a descriptor is therefore associated with a tree with a node whose threshold must be determined as well as the values returned. During learning, all the components of the set of descriptors are evaluated, then the one that minimizes the classification error is chosen for the current "boosting round". The process is repeated until the number of "rounds" T is reached.

It should be noted, however, that, conventionally, the calculation of a single component of a descriptor, except for the magnitude-related component, requires a normalization that requires calculating the entire vector. If these components are not then chosen by the boosting algorithm, they will have been calculated in vain.

In order to make the descriptor calculations as profitable as possible, it is interesting to set up a learning phase consisting of two stages:

A first step where the number of "boosting rounds" is equal to the maximum number of descriptors accepted for this classification stage. Here, the algorithm evaluates all components of all descriptors.

A second step where we determine the origin of the components chosen by "AdaBoost". Then we restrict the possible components for the "boosting" to the only components from the list of previously chosen descriptors. The boosting is then continued on this subset until a satisfactory classification error is obtained.

The training of a level of the cascade is thus realized as follows:

• Calculation of all local descriptors on previously selected positive and negative examples.

• Launch of "AdaBoost" on all components of all descriptors. • Calculation of the descriptors from which the selected components are extracted. • Second launch of " _. Adaboost " _. only on the components of the chosen descriptors.

• Adaptation of the threshold of the final classifier to obtain the desired minimum detection rate (typically 0.999 per floor for a false positive rate of 0.5).

In this approach, the algorithm is not left entirely free to choose the relevant components, but this choice is restricted to components from descriptors already calculated. One variation is to leave the algorithm free to select the relevant components and to use the fast calculation of an isolated component.

The training of a level of the cascade is then carried out as follows: • Calculation of all the components of the local descriptors on the positive and negative examples previously selected.

• Launch of "AdaBoost" on all components of all descriptors.

This variant is likely to bring better results on certain types of objects to be recognized. Learning each level of the cascade is done sequentially. Negative patterns for training the first stage are randomly selected in a video sequence that does not contain positive patterns. This first stage is then tested on the same video, then the false positives of this stage are used as negative examples for the training of the next stage.

This process is repeated all along the cascade, the upper floors being used to correct the misclassification of the lower floors. The training of the cascade stops when the number of false positives obtained is too low to allow satisfactory learning.

During the detection phase, only the local descriptors or the components chosen by the learning algorithm are actually calculated. Conventionally, the implementation is carried out as follows: the signature of the stage is evaluated, and then depending on the result, the next stage is queried or the window moved. The next stage is queried if at this level the window is judged corresponding to the criteria, otherwise the window is dropped.

The classic detection method is summarized in the algorithm shown below: Loop Lines

Loop Columns Buckle Floors

Calculation descriptors Calculation score classification Si Score <0 output Loop Stage End loop stages End loop columns

End loop lines

But to limit the memory requirements of the processor resources, the calculation of the descriptors of a floor and the associated classification score is performed for a whole line of the image. Indeed, the calculation of the same stage on the line ensures a certain continuity in the memory accesses, which is not the case when calculating the different successive stages on a given position. The gain in computation time is of the order of 20 to 30% of the overall time. The proposed method of detection is summarized in the algorithm shown below: Loop Lines

Buckle Floors

Loop Columns If Score [Level-1] [c, I]> 0 Calculation descriptors Calculation score classification Storage Score [Stage] [c, I] End if End loop columns

End Loop Floors End Loop Lines

The method of the invention makes it possible to choose which multi-variable descriptors to use on each stage of a cascade of classifier, to minimize the calculations while ensuring a high recognition rate and a low false positive rate.

Claims

A method of recognizing and automatically locating an entity in a digital image composed of pixels each having a luminance level, said pixels forming a matrix of rows and columns, the pixels being indexed (i, j) in said matrix said method comprising a first so-called learning step and a second so-called recognition step,

The learning step of setting up a cascade composed of stages of classifiers from a series of images of the entity to be recognized and a series of images not containing the entity to be recognized, The recognition step of passing in the cascade of classifiers the parts of the digital image for which the search is to be made, the parts of the image having successfully passed all the stages of the cascade being declared containing the desired entity, characterized in that each of the different images analyzed by the method is represented by a set of multi-variable local descriptors, and in that the parameterization of the cascade of classifiers consists in selecting for each stage a subset of components among the set of multi-variable local descriptors according to their efficiency in statistically separating the images containing and not containing the entity to be recognized, each local multi-variable descriptor consisting on the one hand of a histogram of orientation of the intensity gradients in the image as a function of given directions comprising N components and on the other hand of the sum of the magnitude of the intensity gradient in the image divided by the surface of said image.

2. Recognition and location method according to claim 1, characterized in that the intensity gradients orientation histogram (or HOG) comprises nine components whose directions vary from 0 degrees to 180 degrees.

3. Recognition and localization method according to one of claims 1 to 2, characterized in that the calculation of the descriptors comprises the following preliminary steps: • The coordinates of the image being defined in an orthogonal coordinate system (i, j) calculating the X and Y luminance derivatives at each pixel of the image;

• Calculation of the orientations and the magnitude of the luminance gradient in each pixel; • Calculation of N orientation images corresponding to the N components of the histogram;

• Calculation of the N + 1 integral images corresponding to the N orientation images and the integral image of the magnitude, the integral image of a parameter of a coordinate pixel (i, j) being equal to the sum of the same parameters for the coordinate pixels (x, y) whose x and y coordinates are both less than or equal to i and j, respectively.

4. A method of recognition and location according to claim 3, characterized in that the calculation of the histogram comprises a step of normalizing the luminance gradient in each pixel, said step being performed in two successive sub-steps, said first sub-step of calculating the standard deviation of the luminance in a sliding window, said second substep of multiplying the magnitude of the luminance gradient in each pixel by the ratio of the calculated standard deviation on the sliding window and a standard deviation of reference.

5. A method of recognition and location according to claim 4, characterized in that, when the standard deviation of the luminance in a sliding window is less than or equal to a minimum value, the second substep is deleted.

6. Recognition and localization method according to one of the preceding claims, characterized in that the components of Gradient orientation histograms are standardized so that their sum is equal to unity.

7. Recognition and localization method according to one of the preceding claims, characterized in that the classification algorithm is of the Adaboost type, the classifiers being weak type, that is to say, algorithms capable of discriminating between two classes of objects at least as well as chance.

8. A method of recognition and location according to one of the preceding claims, characterized in that the driving of a cascade level is performed according to the following substeps:

• Calculation of all local descriptors on previously selected thumbnails with or without the entity to be recognized;

• Launch of "learning by decision combination" software on all components of all previous descriptors;

• Calculation of the descriptors chosen descriptors from which are extracted the selected components;

• Second launch of the "learning by decision combination" software only on the components of the chosen descriptors;

9. A method of recognition and location according to one of claims 1 to 7, characterized in that the driving of a cascade level is performed according to the following substeps:

• Launch of "learning by decision combination" software on all components of all descriptors;

10. A method of recognition and location according to one of the preceding claims, characterized in that each stage of the cascade having a first detection rate of the entity present in a thumbnail and a second detection rate of the missing entity in a thumbnail, the number of stages is such that the overall detection rate of the entity present in a thumbnail by the cascade exceeds 97 percent and the overall detection rate of the entity absent in a thumbnail is less than one billionth.

11. Recognition and location method according to one of the preceding claims, characterized in that the calculation of the descriptors of a stage and the associated classification score is performed for a whole line of the image.