EP1697862A1

EP1697862A1 - Method for indexing and identifying multimedia documents

Info

Publication number: EP1697862A1
Application number: EP04805546A
Authority: EP
Inventors: Hassane Essafi; Larbi Résidence les Fonds Fanettes GUEZOULI; Salima c/o Advestigo SAYAH; Ali c/o Advestigo BEHLOUL; Clarisse Mandridake; Louafi Essafi
Original assignee: Advestigo
Current assignee: Surys SA
Priority date: 2003-11-27
Filing date: 2004-11-25
Publication date: 2006-09-06
Anticipated expiration: 2024-11-25
Also published as: EP1697862B1; WO2005055086A1; FR2863080B1; ES2366439T3; IL175956A0; PL1697862T3; ATE510260T1; CA2547557A1; US20070271224A1; US7552120B2; FR2863080A1; AU2004294586A1

Abstract

The method of indexing multimedia documents comprises the following steps: a) for each document identifying and extracting terms ti constituted by vectors characterizing properties of the; b) storing terms ti in a term base comprising P terms; c) determining a maximum number N of desired concepts that group together the most pertinent terms ti; d) calculating the matrix T of distances between the terms ti of the term base; e) decomposing the set P of terms ti of the term base into N portions Pj (1<=j<=N) such that P=P1∪ P2 . . . ∪ Pj . . . ∪ PN, each portion Pj comprising a set of terms tij and being represented by a concept cj, the terms ti being distributed in such a manner that the terms that are farther apart are to be found in distinct portions Pl, Pm, and the terms that are closer together are to be found in the same portion Pl; f) structuring the concept dictionary; and g) constructing a fingerprint base made up of the set of concepts ci representing the terms ti of the documents, each document being associated with a fingerprint that is specific thereto.

Description

Method for indexing and identifying multimedia documents

The present invention relates to methods for indexing and identifying multimedia documents. From a general point of view, the identification of a multimedia document comprises two phases: - A so-called indexing phase, where one seeks to characterize each document of a database previously recorded by a finite number of parameters that can easily be stored and manipulated later. "A so-called research phase, where following a request formulated by the user, for example the identification of an image question, one seeks all the multimedia documents similar or responding to this request. indexing images that implement the extraction of the attributes of the shape of the image component objects if they exist, as well as those of the texture or the background color of the image However, the known methods apply in very specialized fields or involve the processing of a very large amount of information which leads to a complexity and a slowness in the processing of this information The present invention aims at remedying the aforementioned drawbacks and at providing a method indexing and identifying multimedia documents of a general application that streamlines the processing process and leads to shorter processing times while increasing quality of the results and their reliability, which makes it possible to carry out effective searches by the content. These objects are achieved according to the invention, thanks to a method of indexing multimedia documents, characterized in that it comprises the following steps:

(a) identification and extraction for each term document b. constituted by vectors characterizing properties of the multimedia document to be indexed, such as the shape, the texture, the color or the structure of an image, the energy, the oscillation rate or frequency information of an audio signal, or a group of characters of a text, (b) storage of the terms fc. characterizing properties of the multimedia document in a term base comprising P terms, (c) determining a maximum number N of desired concepts grouping the most relevant terms tj, N being an integer less than P, and each concept c, being intended to group together all the related terms from the point of view of their characteristics,

(d) calculating the matrix T of distances between the terms tι of the term base,

(e) decomposing the set P of terms tj of the term base into N parts P _j (1 ≤ _j N N) such that P = PI UP ₂ ^~ UPj ... UP _N , each part P _j comprising a a set of terms ty and being represented by a concept q, the terms tj being distributed in such a way that the most distant terms are in distinct parts P-, P _m and the close terms are in the same part P ₍ ,

(f) structuring the dictionary of concepts so as to constitute a binary tree where the sheets contain the concepts Q of the dictionary and the nodes of the tree contain the information necessary for the scanning of the tree during an identification phase a document by comparison with previously indexed documents, and

(g) constructing an imprint base constituted by the set of concepts q representing the terms ι of the documents to be indexed, each document being associated with an imprint of its own.

More particularly, each concept q of the fingerprint database is associated with a set of information comprising the number NbT of terms in the documents where the concept q is present. According to a particular aspect and the invention, for each document where a concept q is present, a print of the concept q is recorded in the document, this print comprising the frequency of occurrence of the concept q, the identification of the concepts that are related of the concept q in the document and a score which is an average value of the similarity measures between the concept q and the terms tj of the document which are the closest to the concept q. Advantageously, the method according to the invention comprises a step of optimizing the partition of the set P of terms of the term base to decompose this set P into M classes Q (1 <i <M, with M <P) , so as to reduce the error of the distribution of the set P of the terms of the term base in N parts (Pi, P ₂ , ... PN) where each part Pj is represented by the term t | which will be taken as concept q, the error N committing ε being such that ε = Σε _t . where ε _tt = d ² t _t , t _j ) is the error! = ι lt _j p, made when replacing the terms tj of a Pi by tj. In this case, the method may comprise the following steps: (i) decomposing the set P of two-part terms Pi and P ₂ ;

(ii) we determine the two farthest terms tι and tj of the set P corresponding to the largest distance Dy of the matrix T of distances; (iii) for each term t _k of the set P, it is examined whether the distance D _k ι between the term t _k and the term tj is smaller than the distance D _k j between the term t _k and the term tj, if it is the case we assign the term t _k to the part Pi and if this is not the case we assign the term t _k to the part P ₂ ; (iv) step (i) is iterated until the desired number N of points Pj is obtained and at each iteration steps (ii) and (iii) are applied to the terms of the parts Pi and P ₂ . The method according to the invention can be more particularly characterized in that it comprises an optimization from the N disjoint parts Pi, P ₂ , ... P _N Γ of the set P as well as the N terms t ₂ , t _N r which represent them to reduce the decomposition error of the set P in N parts, and in that it comprises the following steps:

(i) calculation of the centers of gravity Q of the parts Pj

(ii) computation of the errors ε = d ² (C _i , t _j ) and εtj = ^ - ² (t, t.) when </ eΛ tjePi replaces the terms tj of the part Pj respectively by Q and by tj

(iii) comparison of εtj and εq and replacement of tj by Q if εq <εtj, (iv) calculation of the new matrix T of distances between the terms tj of the terms base and decomposition process of the set P of the terms of the term base in N parts, unless a condition stop is filled with ^ε ° ^t _ ^a? ' ⁺¹ <threshold, where εc _t represents the error εc, committed at time t.

In order to facilitate the search and the identification of documents, to carry out a structuring of the dictionary of concepts it is produced iteratively at each iteration a navigation map by starting by splitting the set of concepts into two subsets, then selecting a subset at each iteration until the desired number of groups is obtained or until a stopping criterion is satisfied. The stopping criterion can be characterized by the fact that the subsets obtained are all homogeneous with a low standard deviation. More particularly, during the structuring of the dictionary of concepts, navigation indicators are determined from a matrix M = [Ci, c ₂ , ... c _N ] e 9 ^{p * N} of the set C concepts where ^p represents a concept of p values, according to the following steps: (i) a representative w of the matrix M is computed,

(ii) calculating the covariance matrix M between the elements of the matrix M and the representative w of the matrix M, (iii) calculating a projection axis uas elements of the matrix M,

(iv) the value pi = d (u, Ci) - d (u, w) is calculated and the set of concepts C is decomposed into two subsets C1 and C2 as follows:

(v) we store in the node associated with C the information {u, w, | pl |, p2} where pi is the maximum of all pi <0 and p2 is the minimum of all pi> 0, the set information {u, w, | pl |, p2} constituting the navigation indicators in the concept dictionary. According to a particular embodiment, the structural components and the complement of these structural components constituted by the textural components of an image of the document are analyzed, and:

(a) during the analysis of the structural components of the image (a1), the boundary zones of the image structures are distributed in different classes according to the orientation of the local variation of intensity so as to define Structural Support Elements (ESS) of the image, and

(a2) the construction of terms consisting of vectors describing the local and global properties of the structural support elements is carried out by statistical analysis,

(b) when analyzing the textural components of the image

(bl) parametric detection and characterization of a purely random component of the image, (b2) parametric detection and characterization of a periodic component of the image, (b3) parametric detection and characterization of a directional component of the image,

(c) we group in a limited number of concepts the set of descriptive elements of the image constituted by, on the one hand, the terms describing the local and global properties of the structural support elements and, on the other hand, the parameters of the parametric characterizations random, periodic and directional components defining the textural components of the image, and

(d) defining for each document an imprint from the occurrences, positions and frequencies of said concepts. Advantageously, the local properties of the structural support elements taken into account for the construction of terms comprise at least the type of support chosen from a linear strip or a curve arc, the dimensions in length and width of the support, the direction of the support and the shape and the statistical properties of the pixels constituting the support. The overall properties of the structural support elements taken into account for the construction of terms include at least the number of each type of media and their spatial arrangement. Preferably, during the analysis of the structural components of the image, a preliminary test for detecting the presence of at least one structure in the image is carried out and, in the absence of a structure, one goes directly to the step of analyzing the textural components of the image. Advantageously, to proceed to a distribution of the frontier zones of the image structures into different classes, from the digitized image defined by the set of pixels y (i, j) where (i, j) e I x J with I and J respectively denoting the number of rows and the number of columns of the image, the vertical gradient image g _v (i, j) is calculated with (i, j) e I x J and the gradient image horizontal g *, with (i, j) e I x J and partitioning the image according to the local orientation of its gradient into a finite number of equidistant classes, the image containing the orientation of the gradient being defined by the formula

O (i, j) = arc tan gh (j) (1) gv (i, j)

the classes constituting support regions that can contain significant support elements are identified, and from the support regions, the significant support elements are determined and listed according to predetermined criteria. According to a particular aspect of the invention, the shapes of an image of a document are analyzed according to the following steps: (a) a multiresolution is followed followed by a decimation of the image, (b) the image in polar logarithmic space.

(c) representing the image or portion of the image concerned by its Fourier transform H,

(d) Characterization of the Fourier transform H is carried out as follows: (dl) we project H in several directions to obtain a set of vectors whose dimension is equal to the dimension of the projection motion, (d2) we compute the statistical properties of each projection vector, and

(e) the shape of the image is represented by a term tj consisting of the values of the statistical properties of each projection vector. According to one particular aspect of the invention, when indexing a multimedia document comprising video signals, terms tj consisting of keyframes representing groups of consecutive homogeneous images are chosen, and concepts are determined which by grouping terms tj. To determine keyframes constituting terms tj, a score vector VS is first constructed comprising a set of elements VS (i) materializing the difference or the similarity between the content of an image of index i and that of an image of index i-1, and the score vector VS is analyzed in order to determine the keyframes which correspond to the maximums of the values of the elements VS (i) of the score vector VS. More particularly, an index image j is considered to be a keyframe if the value VS (j) of the corresponding element of the score vector VS is a maximum and the value VS (j) is located between two minimums min G and min D and that the minimum Ml such that

Ml = (| VS (j) - min G1, | VS _Q ) - min D | ) is greater than a given threshold. We will again consider the indexing of a multimedia document, comprising audio components, we sample and break the document into frames, which are then grouped into clips each of which is characterized by a term tj constituted by a parameter vector. A frame may comprise, for example, between about 512 and 2048 samples of the sampled audio document. Advantageously, the parameters taken into account for the definition of the terms tj comprise temporal information corresponding to at least one of the following parameters: the energy of the frames of the audio signal, the standard deviation of the energies of the frames in the clips, the ratio of sound variations, the low energy ratio, the oscillation rate around a predetermined value, the high rate of oscillation around a predetermined value, the difference between the number of oscillation rates above above and below the average oscillation rate of the clip frames, the variance of the oscillation rate, the ratio of the silent frames. However, alternatively or additionally, advantageously, the parameters taken into account for the definition of the terms tj comprise frequency information corresponding to at least one of the following parameters: the center of gravity of the frequency spectrum of the transform of Fourier of the audio signal, the bandwidth of the audio signal, the ratio of the energy in a frequency band and the total energy throughout the frequency band of the sampled audio signal, the average value of the variation of the audio spectrum. two adjacent frames in a clip, the cutoff frequency of a clip. More particularly, the parameters taken into account for the definition of the terms tj may comprise at least the energy modulation at 4 Hz. Other features and advantages of the invention will emerge from the following description of particular embodiments, given by way of example, with reference to the accompanying drawings, in which: - Figure 1 is a block diagram showing the process of producing a dictionary of concepts from a database, according to the invention; FIG. 2 shows the principle of constructing a basis of concepts from terms; FIG. 3 is a block diagram showing the structuring process of a dictionary of concepts, according to the invention; FIG. 4 shows the structuring of an impression base implemented in the context of the process according to the invention; FIG. 5 is a flowchart showing the various steps of construction of an impression base; Figure 6 is a flowchart showing the different steps of document identification; Figure 7 is a flowchart showing the selection of a first list of responses; Figure 8 is a flowchart showing the different steps of a phase of the process; indexing of documents according to the method according to the invention, 9 is a flowchart showing the different steps of term extraction in the case of image processing, FIG. 10 is a diagram summarizing the decomposition process of a regular and homogeneous image, FIGS. 13 show three examples of images containing different types of elements; FIGS. 14a to 14f respectively show an example of an original image, an example of an image after processing taking into account the gradient module, and four examples of processed images with dismantling of the border areas of the image, - Figure 15a shows a first example of an image containing a directional element, - Figure 15al is a 3D view of the spectrum of the image of Figure 15a, - Figure 15b is a second exemplary image containing a directional element; FIG. 15bl is a Fourier module image of the image of FIG. 15b; FIG. 15c represents a third exemplary image containing two elem Figure 15c is a Fourier module image of the image of Figure 15c, - Figure 16 illustrates projection directions for pairs of integers (α, β) as part of the transform calculation. Discrete Fourier of an Image, - Figure 17 illustrates an example of a projection mechanism with the example of a pair of inputs (α _k , β _k ) = (2, -1), - Figure 18al represents an example of an image containing periodic components, - Figure 18a2 represents the module image of the

Discrete Fourier of the Figure 18al image, - Figure 18bb represents an example of a synthetic image containing a periodic component, - Figure 18b2 is a 3D view of the Discrete Fourier transform of the image of Figure 18b , showing a pair of symmetrical peaks, 19 is a flowchart showing the various steps of processing an image with establishment of a vector characterizing the spatial distribution of the iconic properties of the image; FIG. 20 shows an example of partitioning an image and Figure 21 shows a rotation of 90 ° of the partitioned image of Figure 20 and the creation of a characteristic vector of this image, Figure 22 shows the decomposition of the image. a sound signal in clip frames, - Figure 23a shows the variation of the energy of a speech signal, - Figure 23b shows the variation of the energy of a music signal, - Figure 24a shows the Zero crossing rate of a speech signal; - Figure 24b shows the zero crossing rate of a music signal; - Figure 25a shows the center of gravity of the frequency spectrum of the short Fourier transform; a speech signal, - Figure 25 b shows the center of gravity of the frequency spectrum of the short Fourier transform of a music signal, - Figure 26a shows the bandwidth of a speech signal, - Figure 26b shows the bandwidth of a musical signal, - Figure 27a shows for three frequency sub-bands 1, 2, 3 the ratio of energy in each frequency sub-band to the total energy of the whole frequency band, for a signal of Figure 27b shows for three frequency sub-bands 1, 2, 3 the ratio of energy in each frequency sub-band to the total energy of the entire frequency band, for a music signal, Fig. 28a shows the spectral flow of a speech signal; Fig. 28b shows the spectral flow of a music signal; Fig. 29 is a graph illustrating the definition of the cutoff frequency of a clip, and - Figure 30 illustrates, for an audio signal, the modulation of energy around 4 Hz. The general principle of the method of indexing multimedia documents according to the invention, which leads to the construction of a fingerprint database, is described first with reference to FIGS. 1 to 5, each indexed document being associated with a fingerprint which it's clean. From a multimedia document base 1, a first step 2 consists of identifying and extracting, for each document, terms tj constituted by vectors characterizing properties of the document to be indexed. By way of example, reference will be made to FIGS. 22 to 30 as to how it is possible to identify and extract terms tj for a sound document. An audio document 140 is first decomposed into frames 160 which are subsequently grouped into clips 150, each of which will be characterized by a term consisting of a vector of parameters (FIG. 22). An audio document 140 will therefore be characterized by a set of terms tj which will be stored in a term base 3 (FIG. 1). The audio documents from which their characteristic vector has been extracted can be sampled for example at 22 050 Hz in order to avoid the aliasing effect. The document is then divided into a set of frames whose number of samples per frame is set according to the type of file to be analyzed. For a high frequency audio document that contains a lot of. Variations, such as films, variety programs or sports programs, the number of samples in a frame must be low, of the order of 512 samples for example. On the other hand, for a homogeneous audio document containing only speech or music for example, this number must be large, for example of the order of 2048 samples. An audio document clip may be characterized by different parameters used to form the terms and characterizing time or frequency information. It is possible to use all or part of the parameters that will be mentioned below to form parameter vectors constituting the terms identifying the successive clips of the sampled audio document. The energy of the frames of the audio signal constitutes a first parameter representing temporal information. The energy of the audio signal varies a lot for the speech whereas it is rather stable for the music. It thus makes it possible to discriminate the speech of the music but also to detect the silences. The energy can be coupled to another temporal parameter such as the oscillation rate (TO) around a value, which can correspond for example to the zero crossing rate (TPZ). Indeed a weak TO and a strong energy are synonymous with a voiced sound while a high TO induces an unvoiced zone. Figure 25a shows a signal 141 which illustrates the variation of the energy in the case of a speech signal. Figure 23b shows a signal 142 which illustrates the variation of the energy in the case of a music signal. Let N be the number of samples in a frame, the volume or energy E (n) is defined by:

E (n) = ^N ΣSZ (ι) - (2) where S _n (i) represents the value of the sample i of the index frame n of an audio signal.

Other parameters representative of temporal information can be deduced from the energy, for example: the standard deviation of the energies of the frames in the clips (also called EEC or VSTD) which constitutes a state defined as the variance of the volumes frames in a clip normalized by the maximum of the frame volume of the clip, - the ratio of the sound variations (RVS) which is constituted by the difference between the maximum and the minimum of the frame volumes of a clip divided by the maximum of the volumes of these frames, - the low energy ratio (or LER) which is the percentage of the frames whose volume is below a threshold (which is fixed for example at 95% of the average volume of a clip). Other parameters make it possible to characterize the temporal aspect of a clip, in particular the rate of oscillation around a value predetermined, which, when this predetermined value is zero, defines a zero crossing rate (or TPZ). The TPZ can also be defined by the number of times the wave goes through zero. Z (n) = Σ \ Sign (S "(i) \\ - (Sign (S _n (il) i) - (3) ^ ι = 0 M Sn (i) The value of sample i, of the frame n: number of samples in a frame f _s : sampling frequency This characteristic is frequently used for the speech / music classification, since the sudden variations of the TPZ are significant for the voiced / unvoiced alternation. therefore, the presence of speech For speech, the TPZ is low for the voiced areas and very high for the unvoiced areas, whereas for the music, the variations of the TPZ are very low Figure 24a shows a curve 143 illustrating a example of TPZ for a speech signal Figure 24b shows a curve 144 illustrating an example of TPZ for a music signal Another parameter characterizing the temporal aspect of a clip may be the high rate of oscillation around a predetermined value which, when this predetermined value is zero, defines a high step rate wise by zero (or HTPZ) The HTPZ can be defined as the ratio of the number of frames whose TPZ is at a value α, for example 1.5 above the average TPZ of the clip (ls): (4) Λ ^" -l such that: avTPZ = - TPZ (n). (5) N _n = o with: n: index of the frame. N: number of frames in a clip. For speech segments the clips are from 0 to 200 s with an HTPZ around 0.15. On the other hand, for the music segments, the clips are from 200 to 350 s and the HTPZ varies around 0,05 and is generally almost zero. For the environment sound the segments corresponding to the clips are from 351 to 450 s, The HTPZ is weak for the white noise and large for a deafening sound (drum for example). It is also possible to define the parameter DTPZ which is constituted by the difference between the number of TPZs above and below the average TPZ of the frames of a clip, as well as the parameter VTPZ which is constituted by the variance of the TPZ. Another parameter characterizing the temporal aspect of a clip is the silent frame ratio (RFS) which is the percentage of non-silent frames in a clip. A frame is non-silent if its volume exceeds a certain threshold (10) and if the value of the TPZ is less than a threshold Tpz. Thus the report of non-silent frames in a clip, can detect silence. Other, statistical properties of the TPZ can be used as characteristic parameters, such as: i) third-order moment of the mean, ii) the number of TPZs exceeding a certain threshold.

The parameters taken into account for the definition of the terms tj may also include frequency information which takes into account the calculation of the Fast Fourier Transform (FFT) of the audio signal. Thus, a parameter called spectacle centroid (CS) can be defined as the center of gravity of the frequency spectrum of the Short Fourier Transform (STFT) of the audio signal: such that S "(: Spectral power of the frame i of the clip n.

The CS parameter is high for music because the heights are spread over a wider area than the speech (usually 6 octaves for music and 3 for speech). It has a relationship with the sensation of the brilliance of the sound you hear. It is an important perceptual attribute for the characterization of the timbre. Figure 25a shows a curve 145 illustrating an example of CS for a speech signal. Figure 25b shows a curve 146 illustrating an example of CS for a music signal. Another parameter is the bandwidth LB which can be calculated from the variance of the previous parameter CS (n).

LB bandwidth is important in both music and speech. Figure 26a shows a graph 147 illustrating an exemplary bandwidth of a speech signal. Figure 26b shows a curve 148 illustrating an example of a bandwidth of a music signal. Another useful parameter is the ERSB ratio between the energy in a frequency sub-band i and the total energy in the entire frequency band of the sampled audio signal. Considering the perceptual properties of the human ear, the frequency band has been divided into four sub-bands where the latter correspond to Cochlear filters. When the sampling frequency is 22025 Hz, the frequency bands are: 0-630Hz, 630-1720HZ, 1720-4400Hz and 4400-11025Hz. For each of these bands we calculate its energy ERSBi, which is the ratio of the energy of the latter on the energy in the whole frequency band. Figure 27a shows three curves 151, 152, 153 illustrating for three frequency sub-bands 1, 2, 3 the ratio of energy in each frequency sub-band to the total energy of the entire frequency band, for a example of speech signal. Figure 27b shows three curves 154, 155, 156 illustrating for three frequency sub-bands 1, 2, 3 the ratio of energy in each frequency sub-band to the total energy of the whole frequency band, for a example of a music signal. Another parameter consists of the spectral flux which is defined as the average value of the variation of the spectrum of two adjacent frames in a clip: FS (n) ≈ η Σ [log (S "(0 + δ) - log (S "(- 1) + δ)] ² (8) N, - = ι

WHERE δ: A constant of low value,

S _n (i): Spectral power of the frame i of the clip n. The spectral flow of speech is generally greater than that of music, and the sound of the environment is the largest. It varies considerably in comparison with the other two signals. Figure 28a shows a curve 157 illustrating the spectral flow of an exemplary speech signal. Figure 28b shows a curve 158 illustrating the spectral flow of an example of a music signal. Another useful parameter is the cutoff frequency of a clip (FCC). Figure 29 shows a curve 149 illustrating the amplitude spectrum as a function of the frequency fe, and the cutoff frequency fc which is the frequency below which 95% of the spectrum energy (the spectral power) is concentrated. To determine the cutoff frequency of the clip, calculate the Fourier transform of the DS clip (n) N1 DS (ή) = ΣS ² (i) (9). The frequency remains determined by: ι = 0 fC JC- ΣS _n ² (i) ≥ Q.95xDS (10) and ΣS " ² (<0.95xZ) S) (11) ι = 0 ι = 0 The FCC is higher for unvoiced sound (high frequency sound) than for voiced sound (presence of speech where the power is concentrated in the low frequencies). This measurement makes it possible to characterize voiced / unvoiced alternations of speech because this value is low for clips containing only music. Other parameters can still be taken into account for the definition of the terms tj of an audio document, such as the modulation of energy around 4 Hz, which constitutes a parameter resulting from both a frequency analysis and a temporal analysis. The energy modulation at 4 Hz (4 ME) is calculated from the contour of the volume, according to the following formula:

where S _n (i): Spectral power of the frame i of the clip n.

W (j): Triangular window centered at 4Hz.

T: Width of a clip. Speech is 4ME more important than music because, for speech, syllable changes are around 4Hz. A syllable is indeed a combination of a zone of low energy (consonant) and a zone of high energy (vowel). FIG. 30 shows a curve 161 illustrating an example of an audio signal and a curve 162 showing for this signal the modulation of the energy around 4 Hz. The case of multimedia documents comprising audio components has been described above. In the case of the indexing of multimedia documents comprising video signals, it is possible to choose terms tj constituted by key-images representing groups of consecutive homogeneous images. The terms tj can in turn represent for example the dominant colors, the textural properties, the dominant zone structures of the keyframes of the video document. In general, in the case of images which will be developed in more detail later, the terms can represent the dominant colors, the textural properties, the structures of the dominant areas of the image. Several methods can be implemented alternatively or cumulatively, as well over the entire image as on portions of the image, to determine the terms tj to characterize the image. In the case of a document containing text, the terms tj may consist of words spoken or written, numbers and other identifiers consisting of combinations of characters (eg combinations of letters and numbers ). We will again consider the indexing of a multimedia document comprising video signals, for which we choose terms tj constituted by keyframes representing groups of consecutive homogeneous images, and we determine concepts q by grouping the terms tj . The detection of keyframes is based on the grouping of the images of a video document into groups each containing only homogeneous images. From each of the groups, one or more images (called keyframes) representing the video document are extracted. The grouping of the images of the video document is based on the production of a score vector called VS representing the content of the video, it characterizes the variation of the consecutive images of the video (the element VSj materializes the difference between the content of the image of index i and that of the index image i-1), VS is equal to zero when the contents irrij and irrij-i are identical and it is important when the difference between the two contents is important. To calculate the signal VS, the three bands of each Irrij RGB image of video index i are summed to constitute a single image called TRi. Then, the image TRi is decomposed into several frequency bands to keep only the low frequency component TRBi. Two mirror filters (a PB Low Pass filter and a High Pass PH filter) are used, which are applied successively to the rows and columns of the image. We will consider two types of filter: Haar's rondelette and the filter whose algorithm is as follows:

Sweep line

From TRk the image Low is produced For each point a ^ y of the image TR to calculate the point b _i; j of the low frequency low image, bj, j takes the median value of a _2X j, j and aadj + i-

Column Sweeping

From the two images down we produce the image TRBk For each point bj, ₂ χj of the image TR make

Calculate the point bbjj of the low frequency low image, bbj, j takes the median value of and bi, 2xj + ι Line and column scans are applied as many times as desired. The number of iterations depends on the resolution of the images in the video. For images of size 512x512 or can set n to three. The result image TRBi is projected in several directions to obtain a set of vectors Vk, k is the projection angle

(the element j of VO, vector obtained following the horizontal projection of the image, is equal to the sum of all the points of the line j of the image).

The direction vectors of the image TRBi are compared with the direction vectors of TRBi-1 to obtain a score i which measures the similarity between these two images. This score is obtained by averaging all the distances of the vectors of the same direction: for each k the distance between the vector Vk of the image i and the vector Vk of the image i-1 is calculated and all these distances are calculated. . The set of all the scores constitutes the score vector VS: the element i of VS measures the similarity between the image TRBi and the image TRBi-1. The vector VS is smoothed to eliminate irregularities due to the noise generated when handling the video. An example of grouping the images and extracting the keyframes will be described below. The VS vector is analyzed to determine the keyframes that correspond to the maximums of the VS values. An index image j is considered a keyframe if the value VS (j) is a maximum and if VSO) is located between two minimums minG (minimum left) and minD

(minimum right) and if the minimum Ml such that

Ml = min (| VS (Cj) -minG 1, | VS (j) -min D |) is greater than a given threshold. To detect the keyframes, we initialize minG with VS (0) then we traverse the vector VS from left to right. At each step, the index j corresponding to the maximum value located between two minimums (minG and minD) is determined and then, depending on the result of the equation defining M1, it is decided to consider j as an index of a keyframe or no. It is possible to take a group of several neighboring keyframes, for example keyframes of indices j-1, j and j + 1. Three cases occur if the minimum of the two slopes, defined by the two minimums (minG and minD) and the maximum value, is not greater than the threshold: i) If | VS (j) - minGI is below the threshold and that minG does not correspond to VS (o), the maximum VS (j) is ignored and minD becomes minG, ii) If | VS (j) - minGI is greater than the threshold and if | VS (j) -minD | is below the threshold, the minD and the maximum VS (j) are kept and minD is ignored unless the closest maximum right of min D is greater than a threshold. In this case, we also keep minD and declare j as an index of a keyframe. In the case where minD is ignored, minD will take the value closest to the minimum located to the right of minD. iii) If both slopes are below the threshold, minG is retained and minD and j are ignored. After selecting a keyframe, iterates the process. At each iteration minD becomes minG. Referring back to Figure 1; from a base of terms 3 including P terms, we proceed in a step 4 to a treatment of the terms tj and their grouping in concepts q (Figure 2) to be stored in a dictionary of concepts 5. It s' is here to develop a set of signatures characterizing a class of documents. Signatures are descriptors that, for example in the case of the image, represent color, shape and texture. A document can then be characterized and represented by the concepts of the dictionary. A print of a document can then be formed by the signature vectors of each concept of the dictionary 5. The signature vector is constituted by the documents where the concept q is present as well as by the positions and the weight of this concept in the document. The terms tj extracted from a database 1 are stored in a database of terms 3 and processed in a module 4 for extracting concepts q which are themselves grouped in a dictionary of concepts 5. Figure 2 illustrates the process constructing a base of concepts q (1 <i <m) from terms tj (l≤ j <n) having similarity scores wij. The concept dictionary production module receives as input the set P of the terms of the database 3 and the desired maximum number N of concepts is set by the user. Each concept q is designed to group together all the neighboring terms from the point of view of their characteristics. To produce the dictionary of concepts, we start by calculating the distance matrix r between the terms of the base 3, this matrix is used to create a partition whose cardinal is equal to the desired number N of concepts. The creation of the dictionary of concepts is carried out in two phases: Decomposition of Pen Nparties P = P_ U P_ ... U / fo

Optimization process of the partition which decomposes Pen / ^ classes P = c U C_ ... UG _/ with Λ / is less than or equal to P. The optimization process aims at reducing the error of the distribution of P in N parts {P P_, ..., P / v} where each part P _f is represented by the term t; which will be taken as a concept, the error committed is then equal to the following expression: ε = Σε _t , ε _t . = _j d ² (t _i , t _j ) is Terror committed when i = l tj≡Pi replaces the terms t of P, by t, -. P can be broken down into N parts so that the terms are distributed in such a way that the farthest terms are in separate parts and the close terms are in the same part. We will first describe the step 1 of decomposition of the set of terms P into two parts Pi and P ₂ : (a) We determine the two most distant terms t / and tj of P corresponding to the greatest distance y 77 (b) For each & of P, t _k is assigned to Pi if the distance D _ki is smaller than the distance D _k j and to P ₂ otherwise. Step 1 is iterated until the desired number of parts is obtained and at each iteration the steps (a) and (b) are applied to the terms of the set PI and the set P2. We will now describe an optimization phase. The starting point for the optimization process is the N disjoint parts of P {Pi, P ₂ , ..., PN} as well as the N terms {ti, t ₂ , ..., t _N } which represent them and it is used to reduce the decomposition of P in {Pi, P ₂ , ..., PN} parts. We begin by calculating the centers of gravities Ci of Pi. Then we calculate Terror εc, = Σ <2 ² (t ,, t,) that we compare with εc ₍ and we replace ti tjeP, by Ci if ε _i is lower than to εt _i. Then, after calculating the new matrix T and if convergence is not reached, one proceeds to decomposition. the stop condition is defined by that is the order of 10 ^"3- εc _t being Terror committed at time t which represents the iteration.There is presented below a matrix T of distances between terms, where Djj denotes the distance between the term tj and the term t _j .

FIG. 3 illustrates, in the case of multimedia documents of various contents, an example of structuring of the concept dictionary 5. In order to facilitate the navigation inside the dictionary 5 and to determine rapidly during an identification phase the concept closest to a given term, the dictionary 5 is analyzed and a navigation map 9 inside the dictionary is established. The production of the navigation map 9 is done iteratively. At each iteration, we start by splitting the set of concepts into two subsets, then at each iteration we select a subset until we obtain the desired number of groups or until the criterion of stop is satisfied. This stopping criterion can be for example that the subsets obtained are all homogeneous with a low standard deviation for example. The final result is a binary tree where the sheets contain the concepts of the dictionary and the nodes of the tree contain the information needed to scan the tree during the identification phase of a document. An example of a module 6 for distributing a set of concepts will be described below. The set of concepts C is represented as a matrix M = [c _l , c ₂ , ..., c _N ] e ^{* N} , with c ,. e SR ^p , where c _i represents a concept of p values. Different methods are possible to ensure an axial distribution. In this case, we start by calculating the center of gravity C as well as the tax used to break the whole into two subsets. The processing steps are as follows: Step 1: calculate a representative of the matrix M such as the centroid w of the matrix M: (13)

Step 2: calculate the covariance matrix M between the elements of the matrix M and the representative of the matrix M with, in the particular case above M - M - we, where e = [l, l, l _> - _» (14)

Step 3: we compute a projection axis of the elements of the matrix M, for example the eigenvector U associated with the largest eigenvalue of the covariance matrix.

Step 4: calculate the value pi = u ^τ (c _t -w) and break the set of concepts C into two subsets C1 and C2 as follows:

The information stored in the node associated with C is {u, w, | pl |, p2} where pi is the maximum of all pi ≤ o and p2 is the minimum of all pi> 0.

The set {u, w, | pl |, p2} constitutes the navigation indicators in the concept dictionary. Indeed to determine, during the identification phase for example, the concept closest to a term ti, we calculate the value pti = u ^τ {t, -w) and select the node associated with Cl if and select node C2 if not. The process is iterated until one of the leaves of the tree has been reached. A singularity detector module 8 may be associated with the module 6 for distributing the concepts. This singularity detector makes it possible to select the set Ci to be broken down. One of the possible methods is to select the least compact set. Figures 4 and 5 illustrate the indexing of a document or database and the construction of a fingerprint database 10. The fingerprint database 10 consists of the set of concepts representing the terms of the documents. documents to protect. Each concept Ci of the fingerprint base 10 is associated with a fingerprint 11, 12, 13 constituted by a set of information such as the number of terms in the documents where the concept is present, and for each of these documents, it is recorded an imprint 11a, 11b, 11c including the index of the document which refers to the address of the document, the number of terms, the number of occurrences of the concept (frequency), the score, as well as the concepts which are neighbors to it in the document. The score is an average value of similarity measures between the concept and the terms of the document that are closest to the concept. The index of a given document which refers to the address of this document is stored in a base 14 of the addresses of the protected documents. The process of generating fingerprints or signatures of documents to be indexed is illustrated in FIG. 5. When recording a document, the relevant terms of the document are extracted (step 21) and the dictionary of concepts is taken into account. (step 22). Each of the terms tj of the document is projected in the space of the dictionary of concepts to determine the concept q representing the term tj (step 23). We then update the footprint of the concept q (step 24). This update is carried out according to whether the concept has already been met, that is to say is present in documents that are already registered or not. If the concept q is not yet present in the database, we create a new entry in the database (an entry in the database corresponds to an object whose elements are objects containing the concept's signature in documents where this concept is present). We initialize the entry created with the signature of the concept. The signature of a concept in a document is materialized mainly by the following information: Address of the document, NbTermes, Frequency, Neighbors Concepts and score. If the concept q exists in the database, we add to the entry associated with the concept its signature in the document which is composed of (Document address, NbTerms, Frequency, Neighbors concepts and score). When the fingerprint base is constructed (step 25), the fingerprint database is recorded (step 26). Figure 6 illustrates a process of identifying a document that is implemented on an online search platform. The purpose of identifying a document is to determine whether a document posed as a question is the reuse of a document from the database. It is based on the measure of similarity between documents. The goal is to identify documents containing protected elements. The recovery can be total or partial. In the latter case, copy item has undergone modifications such as: deleting sentences in a text, deleting a pattern in an image, deleting a clip or sequence in a video document, ... changing the order of the terms or substituting terms in other words in a text. After presenting a document to be identified (step 31), the terms of this document are extracted (step 32). In connection with an impression database (step 25), the concepts calculated from the terms extracted from the question are mapped to the basic concepts (step 33) in order to establish a list of documents having contents. similar to the contents of the document question. The process of establishing the list is as follows: Note p ^: the degree of similarity of the document dj to the document question, with l <j ≤ N, N is the number of documents of the reference base One initializes to zero For each term ti of the question provided in step 331 (FIG. 7), the concept Ci representing it (step 332) is determined. For each document dj where the concept is present, we update its pdj in the following way: pdj = pdj , several functions f can be used for example f (frequency, score) = frequency x score, frequency means the number of occurrences of the concept Ci in the document dj and score is the average of the resemblance scores of the terms of the document dj with the concept Cj. The p _{dj are} ordered and those higher than a given threshold are retained (step 333). The confirmation and validation of the responses is then carried out (step 34). Confirmation of answers: the list of answers is filtered in order to keep only the most relevant answers. The filtering used is based on the correlation between the terms of the question and each answer. Validation: it allows to keep only the answers where there is a great certainty of content recovery. In this step the answers are filtered taking into account the algebraic and topological properties of the concepts inside a document: it is required that the neighborhood in the document question be respected in the documents answers, that is to say that two concepts Neighbors in the question document must be neighbors in the response document. The list of response documents is then provided (step 35). We will now consider more particularly the case of multimedia documents containing images. In particular, for the construction of the fingerprint base, which will serve as a tool for the identification of a document, will be described quick and efficient image identification methods which take into account all the relevant information contained in the images. ranging from the characterization of the structures or objects that compose it, to that of the textured zones and to the background color. The objects of the image are identified by the production of a table summarizing different statistics made on information of the frontier zones of the objects as well as information on the neighborhoods of these border zones. The characterization of textured areas can be carried out using a very fine description of both the spatial and spectral texture according to three fundamental characteristics which are its periodicity, its global orientation and the randomness of its pattern. The texture is here assimilated to a realization of two-dimensional random process. The characterization of color is an important part of the method. It can be used as a first sort of similar answers based on color, or a last decision made to refine the search. In the first part of the fingerprinting phase, we take into account information classified as components belonging to two main categories: - the so-called structural components that describe the perception by the eye of an object that can be isolated or a set of objects arranged according to a spatial arrangement (images 81 and 82 of FIGS. 11 and 12), the so-called textural components which are the complement of the structural components and which reflect the regularity or the homogeneity of the texture patterns (Figures 82 and 83 of Figures 12 and 13). Figure 11 thus shows an image 81 containing structural elements and having no texture patterns. Figure 12 shows an image 81 containing structural elements and a textured background. Figure 13 shows an image 83 without structural elements but fully textured. As noted above, during the fingerprint construction phase, each document in the database is analyzed to extract relevant information. This information will then be listed and analyzed. This analysis is done following a series of procedures that can be summarized in three steps: - Extraction for each document of predefined characteristics and storage of this information in a vector called term. - Grouping in a concept of all the terms "neighbors" from the point of view of their characteristics, which makes the search more concise. - Construction of a footprint that characterizes this document by a small number of entities. Each document is thus associated with an imprint of its own. Figure 8 illustrates the case of indexing an image document 52 contained in a previously recorded image database 51, to characterize this image 52 by a finite number of parameters that can be easily stored and subsequently manipulated. Step 53 is used to extract terms from the document to be searched which are stored in a buffer (step 54). In step 55, a projection in the space of the terms of the reference database is carried out. In step 56, a vector description giving the values of relevance of the terms in the document to be searched is carried out. Step 57 consists of a distribution of the terms in N groups 58 of concepts. Step 59 consists of a space projection of the concepts of each group 58 to obtain N partitions 61. Finally, an orthogonal projection 62 leads to N sets 63 of reduced vector descriptions. During a subsequent search phase, following a request made by a user, for example the identification of a question image, all the multimedia documents that are similar or responding to this request are searched for. To do this, as mentioned above, we calculate the terms of the document question and compare them to the concepts of the database in order to deduce the document (s) from the database that are similar to the document question. The construction phase of the terms of an image will be described in more detail below. The phase of construction of the terms of an image usefully implements the characterization of the structural supports of the image. Structural supports are the elements that make up the scene of the image. The most significant are those that delimit the objects of the scene because they are the ones that characterize the different forms that are perceived when observing any image. This step concerns the extraction of these structural supports. It consists of a dismantling of the border zones of the image objects, which are characterized by places between two zones where strong variations of intensity are observed. This dismantling is carried out by a process which consists in dividing these border zones among different

"Classes" according to the local orientation of the gradient of the image (orientation of the local variation of intensity). This results in a multitude of small elements called "Structural Support Elements" (ESS).

Each ESS actually belonging to a contour of a scene is characterized by a similarity in the local orientation of its gradient. This is a first step that aims to list all structural support elements of the image. The following approach is now taking place from these ESSs, namely the construction of terms describing the local and global properties of the ESSs. The information extracted from each medium is considered to be local properties. Two types of media can be distinguished: straight line elements (EDRs) and curve arcs (EACs). The elements of linear straight EDR are characterized by the local properties which are: - The dimension (length, width) • Main direction (slope) ^» Statistical properties of the pixels constituting the support (average value of energy, the moments)» Informations neighborhood (Local Fourier Transform) The EAC curve arcs elements are characterized in the same way as before, in addition to the curvature of the arcs. Global properties include statistics such as the number of each media type and their spatial arrangement (geometric associations between media: connexites, left, right, media ...). In summary, for a given image, the relevant information extracted from the constituent objects are grouped together in Table 1.

Table 1

The construction phase of the terms of an image also implements the characterization of the relevant textural information of the image. The information coming from the texture of the image is divided according to three visual aspects of the image: ^» the random aspect (as a picture of fine sand, or grass) where no particular arrangement can be detected, - l periodical appearance (like a jacquard sweater) where a pattern repetition (pixel or pixel grouping) is observed, and finally the directional aspect where the patterns generally tend to orient towards one or more preferred directions. This information is obtained by approaching the image by parametric models or representations. Each aspect is taken into account by its spatial and spectral representations which constitute the relevant information of this part of the image. The periodicity and orientation are characterized by the spectral supports whereas the random aspect is expressed by the estimation of the parameters of a two-dimensional autoregressive model. Once all relevant information extracted, we can proceed to the structuring of the terms of the textures.

Table 2

The phase of construction of the terms of an image can finally also implement the characterization of the color of the image. Color is often represented by color histograms, which are invariant to rotation and robust against occlusion and changes in camera views. Color quantization can be done in the RGB (Red, Green, Blue), HSV (Hue Saturation Value) space, or the LUV space, but the color histogram indexing method has proved its worth. limits because it gives a global information of the image, and when indexing can be found images having the same color histogram, but which are completely different. Many authors propose color histograms by integrating spatial information. This consists, for example, in distinguishing the coherent pixels from the incoherent pixels, a pixel is coherent if it belongs to a fairly wide region grouping identical pixels, it is classified as inconsistent if it is part of a region of reduced size. A method of characterizing the spatial distribution of the constituents of the image (for example the color) which is less costly in computation time than the methods mentioned above, and which is robust to rotations and to translation is described below. The various characteristics extracted from the structural support elements as well as the parameters of the periodic, directional and random components of the texture field as well as the parameters of the spatial distribution of the constituents of the image constitute the terms that can be used to describe the content of the image. a document. These terms are grouped into concepts to reduce the useful information of a document. The occurrences of these concepts as well as their positions and frequencies constitute what is called the footprint of a document. These fingerprints will then serve as a link between a question document and the documents of a database, during a document search phase. An image does not necessarily contain all the elements and characteristics described above. Therefore, identifying an image begins with detecting the presence of its constituent elements. FIG. 9 shows an exemplary flow chart of a process for extracting the terms of an image with a first step 71 of characterizing the image objects in structural supports, which may optionally be preceded by a test detection of structural elements to omit this step 71 in cases where the structural elements are absent. Step 72 consists of a test to determine if there is a texture background. If this is the case, we proceed to a step 73 of characterizing the texture background in spectral supports and autoregressive parameters AR, then to a step 74 of characterizing the background color. If there is no structured background, we go directly from step 72 to step 74. Finally, step 75 resides in terms storage and fingerprinting. We will now come back in more detail on the characterization of the structural support elements of an image. The basic principle of this characterization is the dismantling of the border areas of image objects into multitudes of small basic elements called significant media elements (ESSs) which convey useful information from the border areas which are composed of linear bands of variable size, or elbows of different curvatures. Statistics made on these objects will then be analyzed and used to construct the terms of these structural supports. In order to describe more rigorously the principal methods composing this approach, we will note an image digitized by the set {y {i, j), {i, j) e lxJ} _/ where / and J are respectively the number of rows and columns of the image. From the images vertical gradient {g _v {i, j), (i, j) e lx J} and horizontal {g _h (i, j), (i, j) e lχj} previously calculated, this approach consists of partitioning the image according to the local orientation of its gradient into a finite number of equidistant classes. The image containing the gradient orientation is defined by the formula:

(D The partition is no more than an angular subdivision of the 2D plane (of

0 ° to 360 °) by a step of well-defined discretization. The fact of using the local orientation of the gradient as a decomposition criterion of the frontier zones allows a better grouping of the pixels forming part of the same border zone. In order to solve the problem of boundary points that can be shared between two juxtaposed classes, a second partition with the same number of classes as before, but shifted by ^λ class is used. From classes from both partitions, a simple procedure consists of choosing the ones that total the largest number of pixels. Indeed, each pixel belongs to two classes each resulting from the two partitions. Knowing that each pixel is a potential element of a possible ESS, it then votes for the class that contains the most pixels of the two. This is a region where the probability of finding a larger ESS is as high as possible. As a result of the votes, only the classes that total more than 50% of the votes are retained. These are the support regions likely to contain the ESSs. From these support regions, the ESSs are determined, they are listed according to certain criteria which may be: • The length (a threshold lo is determined for this and the 55 ^" below and above this threshold) • The intensity defined by the mean of the module of the gradient of the pixels composing each ESS (a threshold noted lo is then defined, we list those which are lower and higher than this threshold). The contrast defined by the difference between the maximum and the minimum of the pixels At this stage of the process, all the so-called structural elements are known and listed according to the types of pre-identified structural supports They can be extracted from the image of the image origin to make room for the characterization of the texture field, for example, consider the image 81 of Figure 11, taken as image 101 of Figure 14a, the border areas are illustrated in image 102 of Figure 14b. The elements of these frontier zones are then dismantled and distributed according to the orientation of their gradient among different classes represented by the images 103 to 106 of FIGS. 14c to 14f. These various elements constitute the significant support elements, and their statistical analyzes allow to build the terms of the structural component. In the case of Figures 14c to 14f, by way of example, the image 103 corresponds to a class 0 (0 ° - 45 °), the image 104 corresponds to a class 1 (45 ° - 90 °), l image 105 corresponds to a class 2 (90 ° - 135 °) and the image 106 corresponds to a class 3 (135 ° - 180 °). In the absence of structural elements, it is assumed that the image is textured with more or less regular patterns and a characterization of the field of texture is carried out. For this, we can proceed to a decomposition of the image into three components which are: • A textural component containing random or random information (like an image of fine sand, or grass) where no particular arrangement can be detected • A periodic component (such as a jacquard sweater) where a repetition of dominant patterns is observed, • and finally a directional component where the motifs generally tend towards one or more privileged directions. The objective being to perfectly characterize the texture of the image from a set of parameters, these three components are represented by parametric models. Thus, the texture of the regular and homogeneous image denoted y (i, j), (i, j) e lχj} is decomposed into three components 16, 17, 18 as illustrated in FIG. 10, in accordance with the following relation :

& {i, / ^' )} = M.7 + {h { ^j )} + { ^e (i,) (16)

Where {w (i, j)} is the purely random component 16, {h (i, j)} is the harmonic component 17 and {e {i, j)} the directional component 18.

The estimation of the parameters of these three components 16, 17, 18 completes this step of extracting information from a document. Estimation methods are described in the following paragraphs. An example of a method for detecting and characterizing the directional component of the image will first be described. First, we apply a parametric model to the directional component {e (t, /)}. It consists of a countable sum of directional elements where each is associated with a pair of integers (a, β) defining an orientation of angle θ such that 0 = tan ^-1 , # / <*. In other words, e {i, j) is defined by _e (i) = Σe _[a) {i, j) where each _tβ) (i)

is defined by: + tf ^β (ia - jβ) x sm {2π- ^ iβ + ja))] or • / M? is the number of directional elements associated with (a, β), • v _k is the frequency of the k ^th element, • {s _k (i - jβ)} and {t _k (i - jβ)} are the amplitudes. The directional component {e (i, j)} is thus perfectly defined by the knowledge of the parameters contained in the Fsuivant vector: To estimate these parameters, we use the fact that the directional component of an image is represented in the spectral domain by a set of straight lines with orthogonal slopes to those defined by the pairs of integers (a _lt β,) of the model that will be noted (^ β,) ¹ . These lines can be decomposed into a subset of lines of the same slope each associated with a directional element. By way of illustration, Figures 15a and 15b show images 84,

86 containing a directional element and Figure 15c shows an image

88 containing two directional elements. Figure 15al shows a three-dimensional view of the spectrum of image 84 of Figure 15a. Figures 15bl and 15cl show images Fourier module

87, 89 respectively of images 86 and 85 of Figures 15b and 15c.

To calculate the elements of the vector E, we can adopt an approach based on the projection of the image along different directions. The method consists first of all in ensuring the presence of the directional component before estimating its parameters. The detection of the directional component of the image is based on the knowledge of the spectral properties thereof. If we assimilate the spectrum of the image to a 3D image (X, Y, Z), where (X, Y) represent the coordinates of the pixels and Z the amplitude, the lines we want to detect are represented by a set of concentrated peaks along straight whose slopes are defined by the pairs (^ β ^ sought

(see Figure 15al). To determine the presence of these lines, it suffices to count the predominant peaks. The number of these peaks provides information on the presence or absence of directional or harmonic supports. An example of a process for characterizing the directional component will now be described. For that, one proceeds to the calculation of the couples of direction (αr "β) and the determination of the number of directional elements. The calculation of the Discrete Fourier Transform (DFT) of the image is first performed followed by an estimation of the rational slope lines observed in the transformed image Ψ (i, j). For this, we define a set of projections that discretizes the frequency domain at different projection angles θ _k , k finite. This projection assembly can be obtained in different ways. One can for example look for all pairs of prime integers between them (a _k , β _k )

defining an angle θ _k , such that θ _k = tan ^{~ *} that 0 ≤ a _k , β _k ≤ r makes it possible to control the number of projections. The symmetry properties can then be used to obtain all the couples up to 2π. These pairs are illustrated in Figure 16 for 0 ≤ a _k , β _k ≤ 3.

Projections of the module. the DFT of the image are carried out according to the θ _k . Each projection generates a vector of dimension 1, ^v _{{ak, βk)} ι ^No side Vk to simplify the notation, which contains directional information sought. Each projection ^ l is given by the formula: V _k (n) = Σψ (i + τβ _k, j + τa _k), 0 <i + τβ _k <II, 0 <j + τa _k <Jl (19) with n = -i * β _k + j * a _k and 0 ≤ \ n \ <N _{k and} N _k - \ a _k \ (T-1) + \ β _k \ (L-1) + 1, where T * L is the size of the image. Ψ (, y) is the Fourier transform module of the image to be characterized. For each V, the elements of strong energies and their spatial positions are selected. These elements of high energy are those which have a maximum value with respect to a threshold calculated according to the size of the image. At this calculation step, the number of lines is known. The number of directional components Ne is deduced by using the simple spectral properties of the directional component of a textured image. These properties are: 1. The lines observed in the spectral domain of a directional component are symmetrical with respect to the origin. The field of investigation can therefore be reduced to only one half of the domain under consideration. 2. The maxima retained in the vector are candidates to represent lines belonging to directional elements. From the knowledge of the respective positions of the lines on the discrete Fourier transform module TFD, we deduce the exact number of directional elements. The position of the maximum right corresponds to the argument of the maximum of the vector V _k , the other lines of the same element are located every min {L, T}. The projection mechanism is shown in Figure 17 for (α _k .β _k ) = (2, -1). After treatment of the vectors V _k and production of the direction pairs ( _{k k} , β _k ), we obtain the numbers of lines associated with each pair. Thus we can count the total number of directional elements using the two above-mentioned properties and we identify the pairs of integers ( _{k k} , β _k ) associated with these components which are the directions orthogonal to those which have been retained. For all these couples Testimation of the frequencies of each detected element is immediate. Indeed, if one considers only the original image points along the line of equation iâ _k - jβ _k = c, c is the position of the maximum in Vk, and these points are a one-dimensional signal (1 -D) harmonic of constant amplitude and whose frequency is v ₍ ^(aΛ .) Then it is enough to estimate the frequency of this 1-D signal by a conventional method (localization of the maximum value on the DFT 1-D of this In summary, it is possible to implement the method comprising the following steps: The maximum of each projection is determined The maximums are filtered in order to keep only those higher than a threshold.

• For each maximum m ,, corresponding to a couple the number of lines associated with this pair is determined from the properties described above. - the frequency associated with corresponds to the intersection of the maximum line (corresponding to the maximum of the selected projection) with the horizontal line.

We will now describe the computation of the amplitudes {s _k ^{{a, β)} (t)} and {^ - (t)} which are the other parameters contained in the vector E mentioned above. Knowing the direction (â _k , β _k ) and the frequency V _k , we can determine the amplitudes s _k ^{{, β)} (c) and t ^ ^a ' ^β) (c), for c satisfying the formula iâ _k -jβ _k = c, using a demodulation method. Indeed, s _k ^{ ' ^β) (c) is equal to the average of the pixels along the equation line iâ _k -jβ _k = c of the new image obtained by multiplying y (i, j) by + Jâ _k ) This is translated by the equation - »(<:) s ( _k + Jâ _k ) (20) where N _s is nothing other than the number of elements of this new signal. In the same way, we obtain t ^ ' ^β) (c) by applying the equation

The process described above can be summarized by the following steps:

For any directional element ψ _k , β _k ) do For any line (d) calculate 1. the average of the points (i, j) weighted by the + jâ _k ) This average corresponds to Testimation of the amplitude 2. the mean of the points (i, j) weighted by the (Λ sm ^v _k M (iβ _k + jâ _k ) This average corresponds to A + A Testimation of the amplitude tjf ' ^β) (d)

Table 3 below summarizes the main steps of the projection process. Step 1. Calculate the set of projection pairs (a _k , β _k ) e P _r

Step 2. Calculate the module of the DFT of the image y (i, j):

Step 3-For all (a _k , β _k ) e P _r , calculate the vector V _k : the projection of ψ (ω, v) following (a _k , β _k ) according to the formula (19)

Step 4- Straight line detection: For all (a _k , β _k ) e P _r , • determine: M _k ≈maxJF * (/ ^' )}, j • calculate n _k .-, the number of pixels of significant values encountered along the projection save • n _k ej _A∞ the index of the maximum in V _k. • select the directions that justify the criterion, -> s _e ⁿ k where s _e is a threshold to be defined, depending on the size of the image,

The directions selected are considered as straight lines.

Step 5- Save the couples are the orthogonal pairs (a _k , β _k ) retained in step 4.

Table 3

The detection and characterization of periodic textural information of an image contained in the harmonic component {/ z (t, y)} will be described below. This component can be represented by a finite sum of sinusoids 2-D: j) = Σ ^C _P ∞ ^s i® _P + jv _p ) + D _p sin 2π (iω _p + jv _p ), (22) p = \ OR • c _p and D _P are the amplitudes. • (ω _p, v _p) is the / d ⁶ "spatial frequency.

FIG. 18al shows an image 91 containing periodic components and, in FIG. 18b, a synthetic image containing a periodic component. Figure 18a2 shows an image 92 which is a module image of the DFT having a set of peaks. Figure 18b2 shows a 3D view, 94, of the DFT which shows the presence of a pair of symmetric peaks 95, 96. On the spectral domain, the harmonic component thus appears as pairs of isolated peaks symmetrical with respect to the origin (see Figure 18 (a2) - (b2)). This component is a reflection of the existence of periodicities in the image. The information that one seeks to determine are the elements of the vector:

For this, we first detect the presence of this periodic component in the Fourier module image and then estimate its parameters. The detection of the periodic component consists in determining the presence of isolated peaks on the module image of the DFT. One operates in the same way as in the case of the determination of the directional component. According to the method described in Table 1, if the value n _k obtained in phase 4 of the method described in Table 1 is below a threshold, then we are in the presence of isolated peaks which characterize the presence of harmonic component rather only peaks forming a straight line. The characterization of the periodic component amounts to locating the isolated peaks on the module image of the DFT. These spatial frequencies (ώ _p , v _p ) correspond to the position of these peaks (ώ, v _p ) = arg max ψ (ω, v) (24) (ω, v) For the calculation of the amplitudes (c _p , D _p ), a demodulation method is used to estimate the amplitudes of the directional component. For each periodic frequency element (ώ _p , v _p ), the corresponding amplitude is identical to the average of the pixels of the new image obtained by multiplying the image _y {i, j)} by œs (iώ _p + jv _p ). This is expressed by the following formulas:. . _p y (n, m) cos (nώ _p + mv _p ). (25)

D _p = --ΣΣy (n, m) cos (nώ _p + mv _p ) (26)

In summary, a method of estimating the periodic component comprises the following steps:

Step l. Locate the isolated peaks in the second half of the Fourier module image and count their number

Step 2. For any detected peak:

"Calculate its frequency using formula (24)

- Calculate its amplitude using the formula (25 26)

The latest information to be extracted is contained in the purely random component {w (i, j)}. This component can be represented by a 2-D semi-planar non-symmetric (DPNS) support autoregressive model defined by the following difference equation: J ^' ) = - Σa _kJ w (ik, jl) + u {1) (27) {k, l) εS _NM where "l, _v are the parameters to be determined for all (k, l) belonging to s _NM = {(*, /) / k≈o, ι≤ι≤M} u {(k , ήn ≤ k ≤ N, -M ≤ I ≤M} The pair (N, M) is called the order of the model • {«(* _> ./)} is a Gaussian white noise of finite variance σ \. The parameters of the model are given by: The methods for estimating the elements of W are numerous, for example the 2D Levinson algorithm or the least-squares adaptive methods (MCR). We will now describe a method for characterizing the color of an image from which it is desired to extract terms t, representing iconic characteristics of this image, the color being a particular example of these characteristics which may include other characteristics such as the moments algebraic or geometric, the statistical properties, the spectral properties of the pseudo-Zernicke moments. The method is based on the perceptual characterization of color. Firstly, a transformation of the color components of the image of the RGB space (Red, Green, Blue) still called RGB, to the space HSV (Hue Saturation Value) also called HSV. We thus obtain three components: Hue, Saturation, Value. From these three components we determine N colors or iconic components of the image. Each iconic component Ci is represented by a vector of M values. These values represent the angular and annular distribution of the points representing each component as well as the number of points of the component in question. The method developed is illustrated in FIG. 19 with, for example, N = 16 and M = 17. In a first main step 110, from an image 11 of the RGB space, the image 111 of the space (R, G, B) is transformed to the HSV space (step 112) to obtain an image in the space HSV. The HSV model can be defined as follows.

Tint (H): varies from [0 360], and each angle represents a hue. Saturation (S): varies from [0 1], it measures the purity of the colors, and makes it possible to distinguish the colors "bright", "pastels", or "faded". Value (V): It takes values of [0 1], it indicates whether a color is light or dark and how close it is to white or black. The HSV model is a non-linear transformation of the Tespace model (R, G, B). The human eye can distinguish 128 shades, 130 saturations, and 23 shadows. For White V- \ and 5 = 0, black has a value V≈. while the hint H and the saturation S are indeterminate. When V = 1 and 5 = 1 we have a pure color. Each color is obtained by adding white or black to the pure color. To obtain lighter colors we reduce 5 and keep and V, on the other hand, for dark colors we add black by reducing V t we keep H t S. The passage of the color image expressed in the coordinates (R, G, B) in an image expressed in the space (H, S, V) (Hue, Saturation, Value) is carried out as follows: For every point of coordinate (i, j) and value (R, B _k , G) we produce a coordinate point (i, j) and value (H _k , S _k , V _k ) with:

V _k = _M χ (R _k, B _k, G _k)

The HSV space is then partitioned (step 113).

From the values of Hue, Saturation, Value, N colors have been defined. In the case where N is equal to 16, we have: Black, White, Light Gray, Dark Gray,

Medium Gray, Red, Pink, Orange, Brown, Olive, Yellow, Green, Sky Blue,

Blue-green, Blue, Purple, Magenta. For each pixel we evaluate what color it belongs. Then we calculate the number of points of each color. In a second main step 120, the partitions obtained during the first main step are characterized

110. In this step 120, it is sought to characterize each partition Ci obtained previously. A partition is defined by its iconic component and the coordinates of the pixels that make it up. The description of a partition is based on the characterization of the spatial distribution of these pixels (point cloud). The method begins with the calculation of the center of gravity, I ^{λ the} main axis of the scatter plot and the axis perpendicular to this axis. The new cue is used as a reference in the decomposition of partition Ci into several sub-partitions which are represented by the percentage of points constituting each of the sub-partitions. The process of characterizing a partition Ci is as follows: - calculate the center of gravity and the angle of orientation of the components Ci defining the partitioning coordinate system, - calculate the angular distribution of the points of the partition Ci in the N directions in the opposite direction of clockwise, into N sub-partitions defined by: _o 360 2x360 ix360 (N -l) x360 ^C 'N' N '- ^"' ^N"" ^'N} - partition Tespace of image in squares of concentric rays, with calculation in each radius of the number of points corresponding to each iconic component, the characteristic vector is obtained from the number of points of each color distribution Ci, the number of points in the angular distributions as well as the number of points of the image Thus the characteristic vector is represented by 17 values in the example considered In Figure 19, the second processing step 120 is illustrated from the iconic components C0 to C15 showing for the components C0 (module 121) and C15 (module 131) the various steps performed, namely the angular partitioning 122, 132 leading to a number of points in the 8 orientations considered (step 123, 133) and the annular partitioning 124, 134 leading to a number of points in the 8 considered radii (step 125, 135), as well as taking into account the number of pixels of component C0 respectively C15 in the image (step 126, respectively 136 ). Steps 123, 125, 126 lead to the production of 17 values for the CO component (step 127) while steps 133, 135, 136 lead to the production of 17 values for the C15 component (step 137). Of course, the process is analogous for the other components Cl to C14. Figures 20 and 21 illustrate that the method described above is rotational invariant. Thus, in the example of Figure 20, the image is partitioned into two subsets, one containing the crosses x, the other the rounds 0. After calculating the center of gravity as well as the angle of orientation θ, one obtains the orientation mark which will make it possible to obtain the 04 angular sub-distributions (0 °, 90 °, 180 °, 270 °). Subsequently, an annular distribution is made, the number of points in a radius equal to 1 and then 2 is calculated. vector V0 characteristic of the image of Figure 20: 19; 6; 5; 4; 4; 8; 11. The image of Figure 21 is obtained by applying a rotation of

90 ° in the image of Figure 20. By applying the above method to the image of Figure 21, there is obtained a vector VI characterizing the latter which shows that the rotation influences the characteristic vector. This leads to the conclusion that the method is invariant to rotation. As indicated above, the methods for obtaining for an image the terms representing the dominant colors, the textural properties or the structures of the dominant zones of the image, can be applied both to the entire image and to portions of the image. Briefly described below are processes of segmentation of a document that make it possible to produce the portions of the image to be characterized. According to a first possible technique, a static decomposition is carried out. The image is decomposed into blocks with overlap or without overlap. According to a second possible technique, a dynamic decomposition is carried out. In this case, the decomposition of the image into portions is a function of the content of the image. According to a first example of a dynamic decomposition technique, the portions are produced from the seeds that are the points of singularity of the image (inflection points). We start by calculating the seeds, which are then merged so that only a small number of them remain and finally the points of Timages are merged with the seeds. having the same visual (statistical) properties to produce the portions or segments of the image to be characterized. According to another technique using hierarchical segmentation, the points of the image are merged to form the first n classes. Then the points of each class are decomposed into m classes and so on until reaching the desired number of classes. At merge, the points are assigned to the nearest class. A class is represented by the center of gravity and / or a delimiter (bounding box, segment, curve, ...). The main steps of a method for characterizing the shapes of an image will now be described. The characterization of the shape is done in several steps: For a suppression of zoom effect or variation due to the movement of the non-rigid elements of the image (movement of the lips, the leaves of tree, ...), one proceeds by a multiresolution followed by a decimation of the image. For a reduction of the translational effect, the image or portion of the image is represented by its Fourier Transform. For a reduction of the zoom effect, the image is defined in the polar logarithmic space. The following steps can be implemented: a / multiresolution f = wavelet (I, n); where I is the starting image and n is the number of decompositions b / projection of the image in the logPolar space: g (l, m) = f (i, j) with i = l * cos (m) and j = l * sin (m) c / calculation of the Fourier transform of g: H = FFT (g); d / characterization of H: dl / projection of H in several directions (0, 45, 90, ...): the result is a set of vectors whose dimension is equal to the dimension of the projection segment d2 / calculation of the properties statistics of each projection vector (mean, variance, moments).

The term representing the form consists of the values of the statistical properties of each projection vector.

Claims

A method of indexing multimedia documents, characterized in that it comprises the following steps:

(a) identification and extraction for each document of terms tj consisting of vectors characterizing the properties of the multimedia document to be indexed, such as the shape, texture, color or structure of an image, energy, energy content, oscillation or frequency information of an audio signal, or a group of characters of a text,

(b) storing terms tj characterizing properties of the multimedia document in a term base comprising P terms, (c) determining a maximum number N of desired concepts grouping the most relevant terms tj, where N is an integer less than P, and each concept q being intended to group all the neighboring terms from the point of view of their characteristics,

(d) calculating the distance matrix T between the terms tj of the term base,

(e) decomposing the set P of the terms j of the term base into N parts P _j (1 < _j N N) such that P = PI UP ₂ ^" UP ₁ ... UP _N , each part P _j comprising a a set of terms ty and being represented by a concept q, the terms t, being distributed in such a way that the most distant terms are in distinct parts Pj, P _m and the similar terms are in the same part P- _f

(f) structuring of the dictionary of concepts so as to constitute a binary tree where the sheets contain the concepts q of the dictionary and the nodes of Tarbres contain the information necessary for the scanning of Tarbre during a phase of identification of a document compared to previously indexed documents, and

(g) constructing an imprint database consisting of the set of concepts q representing the terms tj of the documents to be indexed, each document being associated with an imprint of its own.

2. An indexing method according to claim 1, characterized in that associates with each concept q of the fingerprint database a set of information comprising the number NbT of terms in the documents where the concept q is present.

3. An indexing method according to claim 1 or claim 2, characterized in that for each document where a concept q is present, a fingerprint of the concept q is recorded in the document, this fingerprint comprising the frequency of occurrence of the concept. q, the identification of concepts that are close to the concept q in the document and a score that is an average value of similarity measures between the concept q and the terms tj of the document that are closest to the concept q.

4. Indexing method according to any one of the claims

1 to 3, characterized in that it comprises a step of optimizing the partition of the set P of terms of the term base to decompose this set P into M classes Q (1 <i <M, with M < P), so as to reduce Terror of the distribution of the set P of terms of the terms base in N parts (Pi, P ₂ , ... PN) where each part Pj is represented by the term tj which will be taken as concept q, Terror N committing ε being such that ε = Σε where ε _t . = Σd ² (t _i , t _j ) is Terror i≈ll ^l t _j eP, committed when replacing the terms tj of a part Pj by tj.

5. Indexing method according to claim 4, characterized in that it comprises the following steps:

(i) decomposing the set P of two-part terms Pi and P ₂ ; (ii) the two farthest terms tj and tj of the set P corresponding to the largest distance Dy of the distance matrix T are determined;

(iii) for each term t of the set P, we examine whether the distance D _ki between the term t _k and the term tj is smaller than the distance D _kj between the term t _k and the term t _j , if c if the term t _k is assigned to the part Pi and if this is not the case, the term t _k is assigned to the part P ₂ ; (iv) step (i) is iterated until the desired number N of points Pj is obtained and at each iteration steps (ii) and (iii) are applied to the terms of the parts Pi and P ₂ .

6. An indexing method according to claim 4 or claim 5, characterized in that it comprises an optimization from the N disjoint parts i Pi, P _{2 /} ... PN! ^* of the set P as well as the N terms {t _lf t ₂ , tu \ which represent them to reduce the decomposition Terror of the set P in N parts, and in that it comprises the following steps:

(i) calculation of the centers of gravity Q of the parts Pj

(ii) calculation of the errors εC- = _J d ² (C _i , t _J ) and εtj = Σd ² (t _i , t _J ) when tjePi tj≡Pt replaces the terms t of the part Pj respectively by Q and by tj,

(iii) comparison of εtj and εq and replacement of tj by Q if εq ≤ εt ,, (iv) calculation of the new matrix T of distances between the terms tj of the term base and decomposition process of the set P of terms of the term base in N parts, unless a stop condition is satisfied with - - - <threshold, where εc _t represents Terror εc _t committed at time t.

7. An indexing method according to any one of claims 1 to 6, characterized in that to carry out a structuring of the dictionary of concepts, it is produced iteratively at each iteration a navigation map by starting by splitting the set of concepts into two subsets, then selecting a subset at each iteration until the desired number of groups is obtained or until a stopping criterion is satisfied.

8. Indexing method according to claim 7, characterized in that the stopping criterion is constituted by the fact that the subsets obtained are all homogeneous with a low standard deviation.

9. An indexing method according to claim 7 or claim 8, characterized in that during the structuring of the dictionary of concepts, we determine navigation indicators from a matrix M = [ci, c _{2 /} ... c _N ] e 9 ^{p * N} of the set C of the concepts qe 9i ^p where q represents a concept of p values, according to the following steps: (i) calculating a representative w of the matrix M (ii) calculating the covariance matrix M between the elements of the matrix M and the representative w of the matrix M, (iii) calculating an axis of projection u of the elements of the matrix M, (iv) the value pi = d (u, ç) - d (u, w) is calculated and the set of concepts C is decomposed into two subsets C1 and C2 of the following way:

(v) we store in the node associated with C the information {u, w, | pl |, p2} where pi is the maximum of all pi <0 and p2 is the minimum of all pi> 0, the set information {u, w, | pl |, p2} constituting the navigation indicators in the concept dictionary.

10. Indexing method according to any one of the claims

1 to 9, characterized in that one analyzes both the structural components and the complements of these structural components constituted by the textural components of an image of the document, and in that:

(a) when analyzing the structural components of the image

(a1) dividing the boundary regions of the image structures into different classes according to the orientation of the local intensity variation so as to define structural support elements (ESS) of the image, and a2) the construction of terms consisting of vectors describing the local and global properties of the structural support elements is carried out by statistical analysis, (b) during the analysis of the textural components of the image (bl) a parametric detection and characterization of a purely random component of the image is carried out, (b2) a parametric detection and characterization is carried out of a periodic component of the image, (b3) a parametric detection and characterization of a directional component of the image is carried out,

(d) defining for each document an imprint from the occurrences, positions and frequencies of said concepts.

An indexing method according to claim 10, characterized in that the local properties of the structural support elements taken into account for the construction of terms comprise at least the type of support chosen from a linear band or an arc of a curve. dimensions in length and width of the support, the main direction of the support and the shape and the statistical properties of the pixels constituting the support.

12. An indexing method according to claim 10 or claim 11, characterized in that the global properties of the structural support elements taken into account for the construction of terms comprise at least the number of each type of media and their spatial arrangement.

13. An indexing method according to any one of claims 10 to 12, characterized in that during the analysis of the structural components of the image is carried out a preliminary test of detection of the presence of at least one structure in the image and, in the absence of structure, go directly to the step of analyzing the textural components of the image.

14. Indexing method according to any one of the claims

10 to 13, characterized in that to proceed to a distribution of the border areas of the image structures into different classes, from the digitized image defined by the set of pixels y (i, j) where (i, j) where I and J denoting respectively the number of rows and the number of columns of the image, we compute the vertical gradient image g _v (i, j) with (i, j) e I x J and the horizontal gradient image g _h with (i, j) e I x J and we proceed to the partitioning of the image according to the local orientation of its gradient into a finite number of equidistant classes, the image containing the gradient orientation being defined by the formula

O (i, j) = arc tan gh (j) gv (i, j)

the classes constituting support regions that can contain significant support elements are identified, and from the support regions, the significant support elements are determined and listed according to predetermined criteria.

15. An indexing method according to any one of claims 1 to 9, characterized in that during the indexing of a multimedia document comprising video signals, terms tj consisting of keyframes representing groups of pixels are selected. consecutive homogeneous images, and q concepts are determined by grouping terms tj.

16. An indexing method according to claim 15, characterized in that for determining key-images constituting terms tj, a score vector VS is first constructed comprising a set of elements VS (i) materializing the difference or the similarity between the content of an index image i and that of an index image i-1, and the score vector VS is analyzed in order to determine the keyframes which correspond to the maximums of the values of the elements VS (i ) of the VS score vector.

17. An indexing method according to claim 16, characterized in that an index image j is considered to be a keyframe if the corresponding element VS (j) of the score vector VS is a maximum and the value VS (j) is located between two minimums min G and min D and the minimum Ml such that Ml = (IVSQ) - min G | , I VS _Ç D - D min |) is greater than a given threshold.

18. An indexing method according to any one of claims 1 to 9, characterized in that during indexing of a multimedia document comprising audio components, the document is sampled and broken down into frames, which are then grouped into clips of which each is characterized by a term tj constituted by a parameter vector.

The indexing method of claim 18, characterized in that a frame comprises between about 512 and 2048 samples of the sampled audio document.

20. An indexing method according to claim 18 or claim 19, characterized in that the parameters taken into account for the definition of the terms tj comprise temporal information corresponding to at least one of the following parameters: the energy of the frames the audio signal, the standard deviation of the energies of the frames in the clips, the ratio of the sound variations, the low energy ratio, the oscillation rate around a predetermined value, the high rate of oscillation around a value predetermined, the difference between the number of oscillation rates above and below the average oscillation rate of the clip frames, the variance of the oscillation rate, the ratio of the silent frames.

21. An indexing method according to any one of claims 18 to 20, characterized in that the parameters taken into account for the definition of the terms tj comprise frequency information corresponding to at least one of the following parameters: the center of gravity of the frequency spectrum of the short Fourier transform of the audio signal, the bandwidth of the audio signal, the ratio of the energy in a frequency band and the total energy throughout the frequency band of the sampled audio signal, the value average of variation the spectrum of two adjacent frames in a clip, the cutoff frequency of a clip.

22. An indexing method according to any one of claims 18 to 21, characterized in that the parameters taken into account for the definition of the terms tj comprise at least the energy modulation at 4 Hz.

23. An indexing method according to any one of claims 1 to 14, characterized in that the shapes of an image of a document are analyzed according to the following steps:

(a) multiresolution followed by decimation of the image,

(b) the image is defined in the polar logarithmic space.

(d) characterizing the Fourier transform H as follows: (d1) projecting H in several directions to obtain a set of vectors whose dimension is equal to the dimension of the projection motion, (d2) the statistical properties of each projection vector are calculated, and (e) the shape of the image is represented by a term tj consisting of the values of the statistical properties of each projection vector.