WO2001004870A1

WO2001004870A1 - Method of automatic recognition of musical compositions and sound signals

Info

Publication number: WO2001004870A1
Application number: PCT/GR2000/000024
Authority: WO
Inventors: Constantin Papaodysseus; Constantin Triantafillou; George Roussopoulos; Constantin Alexiou; Athanasios Panagopoulos; Dimitrios Fragoulis
Original assignee: Constantin Papaodysseus; Constantin Triantafillou; George Roussopoulos; Constantin Alexiou; Athanasios Panagopoulos; Dimitrios Fragoulis
Priority date: 1999-07-08
Filing date: 2000-07-07
Publication date: 2001-01-18
Also published as: EP1147511A1; GR990100235A; GR1003625B

Abstract

The invention refers to a method of automatic recognition of musical compositions and sound signals, which is used for the identification of musical compositions and sound signals played by radio or TV, or performed in public places. According to this invention, there is a selection of a desirably large number of musical compositions and sound signals, which we want to identify. In every one of these signals an original procedure is applied leading to the extraction of a set of characteristics which will finally represent a model signal. Subsequently, for the implementation of the recognition, the unknown musical composition or sound signal is received and digitised. To its digitised version the same procedure of extracting a set of characteristics is applied. These are compared with the corresponding sets of the model signals and with original criteria it is decided if there is a model signal that corresponds to the unknown signal under consideration. Moreover, it is decided which model signal exactly corresponds to the unknown one.

Description

Method of Automatic Recognition of Musical Compositions and

Sound Signals

This invention refers to a method of automatic recognition of musical compositions and sound signals and it is used in order to identify musical compositions and sound signals transmitted by radio, TV and/or performed in public places.

During the past, efforts for the development of methods for the automatic recognition of musical compositions and sound signals have been made, that led to the creation of systems performing this task. However, these methods and the related systems manifest low percentage of successful recognition both for the musical compositions and for the sound signals of interest. The introduced method offers much better percentage of fully automatic recognition, higher or equal than ninety eight percent (98%). According to this invention, there is a selection of a desirably high number of musical compositions and sound signals, which we want to identify. For easy reference we will refer to these compositions and signals with the term model signals. In every one of these signals an original procedure is applied leading to the extraction of a set of characteristics which will finally represent each model signal. Subsequently, for the implementation of the recognition, the unknown musical composition or sound signal is received, in which the same procedure of extracting a corresponding set of characteristics is applied. These characteristics are compared with the corresponding sets of characteristics of the model signals and, by means of a number original criteria, it is decided if one (and which one exactly) of the model signals corresponds to the unknown signal under consideration. This procedure is described in figure 1.

It is stressed that, officially, there is no reference in international bibliography for a similar method or a relative system. In the world market there are very few similar systems which offer a percentage of successful recognition less than sixty percent (60%).

The invention is described more thoroughly below:

First, the whole frequency band from 0 to 11025 Hz is divided to sub-bands that are almost exponentially distributed. An implementation of such a division presented in Table 1. According to this implementation, the whole frequency band from 0 to 11025

Hz is divided in 60 sub-bands.

Subsequently, each model signal is digitised with a random sampling frequency F_s preferably greater than or equal to 11025 Hz and a window of 8192 or 16384 or 32768 sample length, slides on the obtained digitised signal. In every such window, an adaptive Fast Fourier Transform is applied and the Discrete Fourier Transform absolute value is obtained. Next, the frequency domain window is divided in sections according to the aforementioned frequency sub-bands choice (see Table 1) and then, in every such section, all the peaks of the absolute value of the Fourier transform are spotted and the greater one is obtained. The value of this peak is called "section representative". Then the L "representatives" with the greater values are spotted, where the value of L may vary from 13 to 30, while the most frequently used L value is 20. The indicators of the sections corresponding to these representatives, sorted in increasing order, form a vector, which constitutes the "representative-vector" of the window. The above procedure is repeated while the window slides on the whole digitised model signal thus creating all the representative vectors for the specific model signal. Notice that, while the window slides on the model signal, the generated representative vectors often remain unchanged in two successive windows, successive in the sense that they start in two positions differing one sample the one from the other. For this reason, in every representative vector we assign a number indicating the number of subsequent windows in which the specific vector remained unchanged. For that number we will use the name "number of repetitions" of the representative vector. For the set of the generated representative vectors of each model signal we will use the name "the model signal set of representatives". The aforementioned procedure is described in figure 2.

For the identification of the unknown sound signal, which from now on will be called the "unknown signal", the following procedure is used: A part of the unknown signal of length varying from eight (8) to sixteen (16) seconds is received, digitised and registered, at least temporarily. At the beginning of that unknown signal part a window of length W^ = 8192 or W ^ = 16834 or

Wf_ = 32768 samples is obtained; notice that in any case this window will be of the same length with the sliding window which was used for the model signals. In this window a Fast Fourier Transform is applied and the absolute value of it is obtained. Afterwards, all the peaks of the absolute value of the Fourier transform are spotted and S copies of these peaks are created. For the creation of every copy of the peaks, the positions of the peaks are multiplied with a different coefficient /, , i = 0, 1, ...,

S, which is called "window shift coefficient". Thus, S+l different groups of peaks are created. For every one of these groups the following procedure is realised: the section to which each peak corresponds according to the aforementioned frequency sub-bands division, is spotted (see Table 1). For every section to which at least one peak corresponds, the greater peak is kept. The value of this peak is called "representative of the section of the unknown signal corresponding to the shift coefficient f_t ".

Next, the L greater value representatives are spotted, where the value of L is the same with the one used for the model signals. The indicators of the sections corresponding to these representatives, sorted in increasing order, form a vector, which constitutes the "first representative vector of the unknown signal corresponding to the shift coefficient /_/ ".

Afterwards, the window slides for l samples, where the value of ^i may vary from 0,55 * F_s to 1,9 * F_s samples, with most frequently used value the

^₁ = 1,4 * F_s . For the new window position and for every shift coefficient f_j , i =

0, 1,...,S, (S+l) vectors are computed with the way that described above; each such vector will be called "second representative vector of the unknown signal corresponding to the shift coefficient _/)^• ". The above procedure is repeated for M-2 windows, where each window starts at a sample having a distance of l_t samples from the start of the previous one, i = 2, 3, ..., M-l, where the value of M may fluctuate between 7 and 13 windows, the most usual value being M = 9. In this way S+l groups of M representative vectors are obtained; for each such group we will employ the name "group of unknown signal representative vectors corresponding to the shift coefficient ".

It must be stressed that for a specific application the i. _t values, i = 1, 2,..., M-

1, are not necessary equal, but must be kept fixed throughout the whole procedure. The exact number (S+l) of the shift coefficients _ )^• varies from 1 to 15, while their values are given by the formula: l + (— ) * STEP, if i odd

2

// = 1, if i = 0 i = \,2,...,S ,

\ - (ill) * STEP, if i even

where STEP is a parameter expressing the shift step, that usually belongs to the interval [0.005, 0.01], the more frequently used value being 0.0075. The identification procedure described so far is depicted in figure 3.

For the realisation of the unknown signal recognition, each group of unknown signal representatives is being compared with elements of the set of representatives of each model signal separately. To set ideas, each of the S+l groups of M unknown signal representatives is compared with groups of M model signal representatives by means of the method consisting of the following steps:

Ei) If the first representative vector of one group of the unknown signal is called V_j and the first representative vector of the of the model signal is called U_| , then initially, the number of the common elements between these two vectors is calculated. For example, if L = 20 and

V_! =[60555249474339343330292220171411952 l]

U_!=[605855494741393733302825201714119642] then the number of the common elements is thirteen (13). Subsequently, it is checked if the number of the common elements between the vectors V_j and U_j is greater than or equal to the number 0.51* , which is called

"requisite similarity threshold". If, indeed, it is greater than or equal to 0.5 \*L, we proceed to step E₂ below. If it is smaller than 0.51* , then we consider that the set of the tests performed so far did not result to a successful recognition, so, after considering U_j as the next representative- vector of the model signal, we start the comparison procedure again, beginning from the comparison of the vector V_j with the new U_j .

Ej) If the second representative vector of the unknown signal, corresponding to the same shift coefficient with V_j / , is called V₂ and the representative vector of the model signal corresponding to the sample (£_\ *f_j ), is called U₂ , then we calculate the number of the common elements between these two vectors. Afterwards, we check if the number of the common elements between the vectors V₂ and U₂ is greater than or equal to the "requisite similarity threshold".

If it is greater or equal, we proceed to step E₃ below. If it is smaller, then we consider that the set of tests performed so far did not result to a successful recognition, so, after considering U as the next representative- vector of the model signal, the comparison procedure starts again beginning from the comparison of the vector V_j with the new U_j .

•

E(_M-D) If the (M-l) representative vector of the unknown signal corresponding to the same with V_j shift coefficient fj , is called V_(M_j₎ and the representative

M-l vector of the model signal corresponding to the sample 2_J _μ * f ) is called

U_(M-i₎ , then we calculate the number of the common elements between these two vectors.

Next, we check if the number of the common elements between the vectors _M-_l) ^{a <}^ U _M__!) is greater than or equal to the "requisite similarity threshold".

If it is greater or equal, we proceed to step E_M below. If it is smaller, then we consider that the set of tests performed so far did not result to a successful recognition, so, after considering U_j as the next representative- vector of the model signal, the comparison procedure starts again beginning from the comparison of the vector V_j with the new U _j .

E_M) If the M representative vector of the unknown signal corresponding to the same with V_j shift coefficient fj , is called V_M and the representative vector of the

M-l model signal corresponding to the sample ∑ _μ * f ) is called U_M , then we μ=ϊ calculate the number of the common elements between these two vectors V_M and U_M and we check if it is greater than or equal to the "requisite similarity threshold". If it is greater or equal, we proceed to step E_M+ι below. If it is smaller, then we consider that the set of tests performed so far did not result to a successful recognition, so, after considering as U_j the next representative- vector of the model signal, the checking procedure starts again beginning from the comparison of the vector V with the new U_j .

E_M+I) First we check how many of the pairs (Vj ,U_j ), (V₂,U₂ ),..., (V_M ,U_M ) have, according to the previous comparisons, a number of common elements in the interval [0.51 *L, 0.71 *L . If the number of these pairs is greater than 0.34*M, then we consider that the set of tests performed so far did not result to a successful recognition, so, after considering U as the next representative-vector of the model signal, the comparison procedure starts again beginning from the comparison of the vector V_j with the new U_j . If the number of these pairs is smaller or equal than 0.34*M then the following check is realised: For the pairs of the vectors pairs (V , U ), (V₂,U₂), ..., (V_M ,U_M ), having already being compared, we calculate the mean value of the number of the common elements. If this mean value is greater than or equal to 0.71*/, then we consider that the comparison between the group of the M representatives of the model signal corresponding to the shift coefficient f that we checked and the group of the representatives of the unknown signal, is successful. If the mean value is smaller than 0.71* then we consider that the set of tests performed so far did not result to a successful recognition, so, after considering as U_j the next representative-vector of the model signal, the comparison procedure starts again beginning from the comparison of the vector V with the new U . If all possible vectors of the model signal are unsuccessfully compared with one group of representatives of the unknown signal corresponding to the specific shift coefficient / , then we repeat the comparison procedure, using the group of representatives of the unknown signal corresponding to the next shift coefficient f_i+l . If the comparison of a specific set of model vectors with all (S+l) groups of representatives of the unknown signal is unsuccessful, then we proceed to the comparison of the unknown signal with another set of model vectors.

If the result of the above comparison is successful for a group of the unknown signal corresponding to a specific shift coefficient, let's say f_ε, we proceed to the application of the irrevocable comparison criterion, which will be described below. As it is already mentioned, the successful application of the first aforementioned criterion results to the determination of a group of M representatives of the model signal U_j , U₂ ,..--U_M which "fit" to the group of the representatives of the unknown signal V_j , V₂, ...,V_M corresponding to the specific shift coefficient f_ε . Since the positions of these vectors in their corresponding signals are now known, it is possible to realise a sequence of comparisons between vectors of the unknown signal, corresponding to the specific shift coefficient f_ε, with the vectors of the model signal formed at the specific positions where the first criterion was satisfied.

In this way, in the digitised unknown signal of duration from eight (8) to sixteen (16) seconds, a window of length W_e is obtained beginning at the unknown signal starting point. In this window a fast Fourier transform is applied again and its absolute value is obtained. Subsequently, the peaks of the Fourier transform are spotted and their positions are multiplied with the shift coefficient f_ε, which has been previously verified that satisfies the first criterion. Then, in each section, the peaks are sorted according to their value. In each section to which at least one peak has been previously ascribed, the greater peak is kept to form the "representative vector of the unknown signal". Next, the L greater value representatives are spotted, where the value of L is the same with the one used in the first criterion. The indicators of the sections corresponding to these representatives, sorted in increasing order, form a vector that constitutes the "first irrevocable representative-vector of the unknown signal".

Then, the window slides for k samples, where the value of kγ is equal to

H * (M — X) _{ne va}ι_{ue 0}f β fluctuates between 30 and 50. For the new window

D -\ position a new vector is calculated, the same way as it was described before, called "second irrevocable representative-vector of the unknown signal". The above procedure is repeated for over D-2 windows, each one starting at a distance of k_t

. * (M -\) samples from the start of its previous window, where k_t = , / = 2, 3, ..., D-\.

In this way, finally, a group consisting of D representative- vectors is created.

We will refer to this group with the name "irrevocable group of representatives of the unknown signal".

In order to obtain the final decision if the unknown signal corresponds to the model signal in hand, the irrevocable group of representatives of the unknown signal is compared to elements of the set of the representatives of the model signal, by means of a method similar to the first criterion consisting of the steps briefly described below:

Ti) If the first irrevocable representative- vector of the unknown signal is called * * VJ and U _j is called the representative- vector of the model signal corresponding to the position, let's say Λi, where the first criterion has been satisfied, then initially we calculate the number of the common elements between these two vectors.

T₂) If the second irrevocable representative-vector of the unknown signal is called V₂ , then this vector is compared with vector U₂ , which is the representative vector of the model signal corresponding to the position

+ k * f_ε, where f_ε is the shift coefficient that has been calculated from the first criterion.

T_(D-D) If the (D-l)* irrevocable representative- vector of the unknown signal is called V _jj_ ₎ and the representative- vector of the model signal corresponding to

D-2 the sample ( _j * f_ε) is called U(_D_JJ , then we calculate the number of the y=ι common elements between these two vectors.

Finally, having calculated the number of the common elements between these D pairs of vectors, in order to decide for the identification, we check if the two conditions stated below are satisfied:

[Condition 1] At least 0.825 * D from the pairs of the vectors, have common number of elements greater than 0.71 * L. [Condition 2] The total number of the common elements of the vectors, namely the sum of the common elements of the pairs

(Vi ,Uj),(V_{2 j}U^- V^-_i^U ._j))), is greater than 0.6875 * D * L.

If these two conditions are satisfied, then we have successfully recognised that the specific musical composition corresponds to the model signal in hand.

The whole procedure of the identification is described in the Figures 3, 4 and 5.

Table 1

Claims

The method for the automatic recognition of musical compositions and sound signals, which is used for the identification of musical compositions and sound signals played by radio or TV or performed in public places, is based on the existence of a procedure, which is applied to the model signals and results to the extraction of a set of characteristics, which will finally represent each model signal. Besides, it is based on a similar procedure, which applies to the unknown musical composition or sound signal for the extraction of similar characteristics and, finally, it is based on a procedure of comparison performed between the representative sets of characteristics of the model and the unknown signal. This method is characterised by the model sets of characteristics corresponding to division of the frequency domain in bands. It is also characterised by two original criteria for the decision of the identification, according to which, two musical compositions or sound signals are identified only when: a) A group of M representative vectors of the model signal U , U₂ , ...,U_M where two successive vectors are calculated at samples having distance ,■ , /^' = 1, 2, ..., M-l, "match" with a group of representatives of the unknown signal V_j , V₂, ..., V_M , which corresponds to a specific shift coefficient f_ε . Notice that, the values of £_j , i = 1, 2, ..., M-l, are not necessarily equal, but , in any case, are kept fixed throughout the application. The matching betweenU , U₂, ...,U_M and V_j , V₂, ..., V_M is realised by means of the following criterion:

All comparisons between the vectors of the pairs (V_j ,U ), (V₂,U₂), ..., ( V_M , U_M ) are made and the number of pairs with common elements in the interval [0.51 * L, 0.71 * L] is computed. If it is greater than 0.34 * M then we consider that the set of comparisons performed so far did not result to a successful recognition. If this number is smaller or equal than 0.34 * M then it is checked if the mean value of the number of common elements of vectors of the above pairs (V_j ,U ),

(V₂,U₂), ..., (V_M ,U_M ) is greater than or equal to 0.71 * L. If it is, then we consider that the comparison between the group of the M representatives, corresponding to the shift coefficient f_ε, of the model signal in hand and the group of representatives of the unknown signal is successful. b) A second group of D irrevocable representative- vectors of the model signal

U , U₂, ...,U_j) being calculated at a distance k_} the one from its previous, where

. e * (M -\) k_j = , / = 1, 2, ...-Z , which are not necessarily equal, but are, in any case, kept fixed throughout the application, "match" with a group of representatives of the unknown signal V_j , V₂ , ... , V_D which corresponds to a specific shift coefficient f_ε , according to the following criterion: • At least 0.825*Z⁾ from the pairs of the vectors (V_j ,U ),(V₂ ,U₂),...,(V_(D__j) ,U __j)) , have common number of elements greater than 0.71 * L.

• The total number of the common elements of the vectors (namely the one that results from the summation of the common elements of the pairs

is greater than 0.6875 * D * L.

If both these conditions (a) and (b) are satisfied, then we have successfully recognised the specific musical composition.