FR2739482A1

FR2739482A1 - Speech signal analysis method e.g. for low rate vocoder

Info

Publication number: FR2739482A1
Application number: FR9511604A
Authority: FR
Inventors: Pierre Andre Laurent
Original assignee: Thomson CSF SA
Current assignee: Thales SA
Priority date: 1995-10-03
Filing date: 1995-10-03
Publication date: 1997-04-04
Anticipated expiration: 2015-10-03
Also published as: FR2739482B1

Abstract

The method involves sampling a speech signal and then segmenting the signal into sub-bands. The resultant signal in each sub-band is transposed into a baseband signal (2). A calculation is then performed to determine the square of the auto-correlation modulus (p(M)) of the transposed signal in each sub-band. The fundamental frequency value of the speech signal is estimated by selection from amongst those auto-correlation signals which have the maximum amplitude. Selection is then performed (4) in each maximum auto-correlation sub-band which corresponds to a value at the chosen fundamental frequency and at multiples and sub-multiples of this frequency. These signals are then corrected prior to transmission.

Description

La présente invention concerne un procédé et un dispositif pour l'évaluation du voisement du signal de parole par sous bandes dans des vocodeurs. The present invention relates to a method and a device for evaluating the voicing of the speech signal in sub-bands in vocoders.

Elle s'applique notamment à la réalisation de vocodeurs utilisables dans une plage de débit binaire comprise entre 2400 bits/s et 4800 bits/s. Pour des débits de 2400 bits/s et moins jusqu'à environ 600 bits/s, il existe en effet des réalisations satisfaisantes, mettant en oeuvre des processus connus sous le nom de "prédiction linéaire" suivant lesquels le signal de parole est synthétisé à la sortie d'un filtre de prédiction sur l'entrée duquel est appliqué soit un signal périodique pour reconstituer les sons voisés représentant les voyelles, soit un signal aléatoire pour la restitution des consonnes. It applies in particular to the production of vocoders usable in a bit rate range between 2400 bits / s and 4800 bits / s. For bit rates of 2400 bits / s and less up to around 600 bits / s, there are indeed satisfactory embodiments, implementing processes known as "linear prediction" according to which the speech signal is synthesized at the output of a prediction filter on the input of which is applied either a periodic signal to reconstruct the voiced sounds representing the vowels, or a random signal for the reproduction of consonants.

Pour des débits supérieurs à 4800 bits, des procédés récents connus sous l'abréviation anglo saxonne CELP de "Code Exicited Linear Prediction" découlent des procédés précédents à ceci près que le signal appliqué à l'entrée du filtre de synthèse possède une forme plus complexe. Ces procédés permettent une reproduction relativement fidèle du signal de parole. For bit rates higher than 4800 bits, recent methods known by the Anglo-Saxon abbreviation CELP of "Code Exicited Linear Prediction" follow from the previous methods except that the signal applied to the input of the synthesis filter has a more complex form. . These methods allow a relatively faithful reproduction of the speech signal.

Dans la plage de débit précitée, comprise entre 2400 bits/s 4800 bits/s, plusieurs procédés concurrents de qualité voisine ont pour caractéristique commune de découper la bande de fréquence audio du signal de parole qui est comprise typiquement entre 300 et 3300 Hz en plusieurs sous-bandes jointives et de tenter de représenter le mieux possible le signal présent dans chaque sous bande. Cette façon de procéder se justifie par le fait que l'oreille humaine peut être considérée elle même comme un banc de filtres couplés à la façon d'un analyseur de spectre complexe capable de discriminer les différentes composantes spectrale du signal de parole accompagné éventuellement d'un bruit de fond. Un premier type de vocodeur mettant en oeuvre ce principe a vu le jour au cours des années 1970.Ce vocodeur avait pour caractéristique de découper la bande audio fréquence en 12 sous -bandes, par tiers d'octave, ou en 24 sous-bandes, suivant une échelle de type MEL dont la caractéristique est d'être linéaire dans les basses fréquences, et non linéaire ailleurs. Le niveau de signal mesuré dans chaque sous bande correspond au gain à utiliser en sortie des filtres passe bande de synthèse utilisés. Suivant ce mode de réalisation les signaux appliqués sur les entrées des filtres de synthèse sont soit aléatoires, soit périodiques. Suivant un deuxième mode de réalisation, plus généralement connu sous l'abréviation HELP de "Harmonic Enhanced Linear
Prediction", le signal d'excitation du filtre de prédiction est enrichi.Une fois calculée la période de la raie fondamentale du signal de parole qui est encore désignée sous le vocable "Pitch" dans le langage anglo-saxon, la proportion de signal déterministe représentant les sons voisés relativement au bruit est déterminée en tenant compte des niveaux des raies spectrales obtenues après une analyse appelée "pitch-synchrone" suivant laquelle les raies sont analysées dans une fenêtre qui dépend de la valeur de la raie fondamentale.In the aforementioned bit rate range, between 2400 bits / s and 4800 bits / s, several competing methods of similar quality have the common characteristic of cutting the audio frequency band of the speech signal which is typically between 300 and 3300 Hz into several contiguous sub-bands and to try to represent as well as possible the signal present in each sub-band. This way of proceeding is justified by the fact that the human ear can be considered itself as a bank of filters coupled in the manner of a complex spectrum analyzer capable of discriminating the different spectral components of the speech signal possibly accompanied by background noise. A first type of vocoder implementing this principle emerged in the 1970s. This vocoder had the characteristic of dividing the audio frequency band into 12 sub-bands, per third of an octave, or into 24 sub-bands, according to a MEL type scale whose characteristic is to be linear in the low frequencies, and nonlinear elsewhere. The signal level measured in each sub-band corresponds to the gain to be used at the output of the synthesis band pass filters used. According to this embodiment, the signals applied to the inputs of the synthesis filters are either random or periodic. According to a second embodiment, more generally known by the abbreviation HELP of "Harmonic Enhanced Linear
Prediction ", the excitation signal of the prediction filter is enriched. Once the period of the fundamental line of the speech signal is calculated, which is also referred to as" Pitch "in English, the proportion of deterministic signal representing the voiced sounds relative to the noise is determined by taking into account the levels of the spectral lines obtained after an analysis called "pitch-synchronous" according to which the lines are analyzed in a window which depends on the value of the fundamental line.

Enfin, suivant un dernier mode de réalisation qui concerne les vocodeurs généralement connus sous la désignation anglo-saxonne de "vocodor IBME" ou IBME est l'abréviation de "lmproved Multiband Excitation", la raie fondamentale ou Pitch la plus probable est recherchée dans chaque sousbande et après sélection d'une valeur de raie fondamentale commune à toutes les sous-bandes, des opérations similaires mais plus complexes à celles exécutées dans les vocodeurs de type HELP sont effectuées. Tout comme dans les vocodeurs de type HELP, la synthèse du signal de parole a lieu par exécution d'une transformée de Fourier rapide inverse reconstituant les différents harmoniques du signal fondamental avec leurs amplitudes respectives pour les sons voisés.Finally, according to a last embodiment which relates to vocoders generally known by the Anglo-Saxon designation of "vocodor IBME" or IBME is the abbreviation of "lmproved Multiband Excitation", the most likely fundamental line or Pitch is sought in each sub-band and after selection of a fundamental line value common to all the sub-bands, operations similar but more complex to those performed in the HELP type vocoders are performed. As in the HELP type vocoders, the speech signal is synthesized by performing an inverse fast Fourier transform reconstructing the different harmonics of the fundamental signal with their respective amplitudes for the voiced sounds.

Cependant, quelle que soit la complexité des modes de réalisation précités, ceux ci apparaissent fortement perturbés lorsque l'environnement du signal de parole est bruité. En effet, dans ces conditions, il devient illusoire de chercher à reproduire la forme d'onde du signal de parole du fait que le rapport signal à bruit exprimé en dB, est proportionnel au débit. A titre d'exemple, si le rapport S/B est de 16 dB pour un débit de 4800 bits/s, celui ci tombe environ à 8 dB à 2400 bits/s, ce qui est largement insuffisant pour conserver au signal de parole toute son intelligibilité. However, whatever the complexity of the abovementioned embodiments, these appear to be highly disturbed when the environment of the speech signal is noisy. Indeed, under these conditions, it becomes illusory to try to reproduce the waveform of the speech signal because the signal to noise ratio expressed in dB, is proportional to the bit rate. For example, if the S / N ratio is 16 dB for a bit rate of 4800 bits / s, this drops to around 8 dB at 2400 bits / s, which is largely insufficient to keep the whole speech signal its intelligibility.

Le but de l'invention est de palier les inconvénients précités. The object of the invention is to overcome the aforementioned drawbacks.

A cet effet l'invention a pour objet, un procédé pour l'évaluation du voisement du signal de parole par sous-bandes dans des vocodeurs caractérisé en ce qu'il consiste, après avoir échantillonné puis découpé le signal de parole en sous-bandes, à transposer le signal résultant dans chaque sous-bandes en bande de base, à calculer le carré du module de l'autocorrélation du signal transposé dans chaque sous-bande, à estimer la valeur de la fréquence fondamentale du signal de parole en sélectionnant parmi les autocorrélations effectuées sur les signaux présents dans l'ensemble des sous-bandes celle dont l'amplitude est maximum, à sélectionner dans chaque sous-bande l'autocorrélation maximum qui correspond à une valeur soit de la fréquence fondamentale estimée, de ses multiples et sous multiples soit d'une valeur de leur voisinage, et à convertir en taux de voisement les valeurs des autocorrélations obtenues avant leur transmission. To this end, the subject of the invention is a method for evaluating the voicing of the speech signal by sub-bands in vocoders, characterized in that it consists, after having sampled and then divided the speech signal into sub-bands , to transpose the resulting signal in each sub-band into base band, to calculate the square of the module of the autocorrelation of the signal transposed in each sub-band, to estimate the value of the fundamental frequency of the speech signal by selecting from the autocorrelations performed on the signals present in all of the sub-bands, the one whose amplitude is maximum, to select in each sub-band the maximum autocorrelation which corresponds to a value of either the estimated fundamental frequency, its multiples and in multiples of a value of their neighborhood, and to convert the values of the autocorrelations obtained before their transmission into a voicing rate.

D'autres caractéristiques et avantages de l'invention apparaîtront dans la description qui suit faite en regard des dessins annexés qui représentent
La figure 1 un dispositif pour la mise en oeuvre du procédé selon l'invention.Other characteristics and advantages of the invention will appear in the description which follows given with reference to the appended drawings which represent
Figure 1 a device for implementing the method according to the invention.

La figure 2 une courbe de réponse d'un filtre mis en oeuvre dans l'invention pour effectuer un prétraitement du signal de parole. FIG. 2 a response curve of a filter implemented in the invention for carrying out a preprocessing of the speech signal.

La figure 3 un graphe illustrant la méthode de seuillage utilisée pour sélectionner les taux de voisement. Figure 3 is a graph illustrating the thresholding method used to select voicing rates.

Le procédé selon l'invention reprend les principes connus du codage par prédiction linéaire qui permettent grâce à une transmission des coefficients d'un filtre prédicteur de définir d'une façon très économique une première approximation de l'enveloppe spectrale du signal à représenter. La synthèse du signal de parole a lieu en appliquant à l'entrée d'un filtre prédicteur un signal d'excitation défini dans un certain nombre de sous-bandes jointives et réparties de façon linéaire ou non dans la bande des fréquences audio, typiquement définie entre 300 et 3400 Hz. Il consiste principalement à mesurer le taux de voisement dans chaque sous-bande en lui attribuant une valeur comprise entre 0 et 1, faisant correspondre à un taux de voisement nul une excitation du filtre prédicteur purement aléatoire, et à un taux de voisement égal à 1 un signal parfaitement périodique.Selon l'invention, le taux de voisement est défini comme le rapport entre la puissance de la composante purement périodique du signal de parole à sa puissance totale. Le procédé consiste, après avoir échantillonné le signal de parole, et effectué le partage du signal de parole en un nombre déterminé K de sous-bandes, à effectuer dans chacune des sous-bandes un passage du signal en bande de base, en générant un signal complexe, dit "signal analytique", à calculer l'autocorrélation normalisée de ce signal analytique, à combiner les résultats des autocorrélations obtenues dans les différentes sous-bandes pour obtenir une première estimation de la période du fondamental, puis à effectuer également dans chacune des sous-bandes une estimation du taux de voisement maximal autour de la période trouvée et à corriger le taux de voisement ainsi trouvé de manière à limiter les fausses décisions. The method according to the invention takes up the known principles of coding by linear prediction which allow, by means of a transmission of the coefficients of a predictor filter, to define in a very economical manner a first approximation of the spectral envelope of the signal to be represented. The speech signal is synthesized by applying to the input of a predictive filter an excitation signal defined in a certain number of contiguous sub-bands and distributed linearly or not in the audio frequency band, typically defined between 300 and 3400 Hz. It mainly consists in measuring the voicing rate in each sub-band by assigning it a value between 0 and 1, making correspond to a voiding rate an excitation of the purely random predictor filter, and to a voicing rate equal to 1 a perfectly periodic signal. According to the invention, the voicing rate is defined as the ratio between the power of the purely periodic component of the speech signal to its total power. The method consists, after having sampled the speech signal, and effected the division of the speech signal into a determined number K of sub-bands, in effecting in each of the sub-bands a passage of the signal in baseband, by generating a complex signal, called "analytical signal", to calculate the normalized autocorrelation of this analytical signal, to combine the results of the autocorrelations obtained in the different sub-bands to obtain a first estimate of the period of the fundamental, then to also perform in each sub-bands an estimate of the maximum voicing rate around the period found and to correct the voicing rate thus found so as to limit false decisions.

Le dispositif pour la mise en oeuvre du procédé précité qui est représenté à la figure 1, comporte un dispositif de prétraitement 1 du signal de parole, couplé à un ensemble de K dispositifs référencés de 21 à 2k de calcul d'autocorrélation du signal dans K sous bandes. Les sorties des dispositifs 21 à 2k sont couplées d'une part, à un dispositif de combinaison des autocorrélations 3 et d'autre part, à K dispositifs de recherche de maximum de taux de voisement et de correction référencés respectivement de 41 à 4k Un dispositif 5 d'évaluation de la période du fondamental est couplé entre le dispositif 3 de combinaison des autocorrélations et les K dispositifs de recherche de maximum de taux de voisement et de correction 41 à 4k. The device for implementing the above-mentioned method which is represented in FIG. 1, comprises a device 1 for preprocessing the speech signal, coupled to a set of K devices referenced from 21 to 2k for autocorrelation calculation of the signal in K under strips. The outputs of the devices 21 to 2k are coupled on the one hand, to a device for combining the autocorrelations 3 and on the other hand, to K devices for seeking maximum voicing rate and correction referenced respectively from 41 to 4k A device 5 for evaluating the period of the fundamental is coupled between the device 3 for combining the autocorrelations and the K devices for seeking maximum voicing rate and correction 41 to 4k.

Le dispositif de prétraitement 1 se compose d'un filtre ayant pour fonction de transfert:

dans laquelle ai représente les coefficients du filtre et y est une constante, voisine de 0,7. Les coefficients ai sont obtenus de façon connue par autocorrélation du signal de parole. L'intérêt de ce filtrage est qu'il permet, en fournissant un spectre résiduel pas trop plat, de faciliter l'évaluation du fondamental, en éliminant les risques de fausse évaluation sur des formants pouvant apparaître entre 400 et 1200 Hz environ.The pretreatment device 1 consists of a filter having the transfer function:

in which ai represents the coefficients of the filter and y is a constant, close to 0.7. The coefficients ai are obtained in known manner by autocorrelation of the speech signal. The advantage of this filtering is that it allows, by providing a not too flat residual spectrum, to facilitate the evaluation of the fundamental, by eliminating the risks of false evaluation on formants which can appear between 400 and 1200 Hz approximately.

Avant d'effectuer l'autocorrélation du signal filtré par le dispositif de prétraitement 1, les dispositifs 21 à 2K effectue un passage en bande de base d'un nombre déterminé K de sous-bandes, afin d'éviter que la recherche des maxima de l'autocorrélation dans chaque sous-bande ne soit perturbée par l'existence, de plusieurs maxima secondaires autour du maximum vrai. Le passage en bande de base a aussi pour intérêt qu'il permet de rendre minimale la puissance de calcul nécessaire à condition de choisir une fréquence d'échantillonnage un peu supérieur à la largeur de la sous-bande de fréquence pour ne pas perdre d'information.Le passage en bande de base d'une sousbande de largeur B centrée autour d'une fréquence Fc a lieu par exemple, en multipliant par exp < -jne) le nme échantillon du signal Sn fourni par le dispositif de prétraitement 1, où e est la rotation de phase d'une sinusoïde de fréquence
Fc pendant une période d'échantillonnage et en filtrant par un filtre passe bas de fréquence de coupure B/2 le signal complexe obtenu. Pour réduire le volume des données à traiter, un sous échantillonnage du signal peut éventuellement être réalisé.Before performing the autocorrelation of the signal filtered by the preprocessing device 1, the devices 21 to 2K pass through a baseband of a determined number K of sub-bands, in order to prevent the search for maxima of the autocorrelation in each sub-band is not disturbed by the existence of several secondary maxima around the true maximum. The transition to baseband also has the advantage that it makes it possible to minimize the computing power necessary on condition of choosing a sampling frequency slightly greater than the width of the frequency sub-band so as not to lose The passage in baseband of a subband of width B centered around a frequency Fc takes place for example, by multiplying by exp <-jne) the nth sample of the signal Sn supplied by the preprocessing device 1, where e is the phase rotation of a sinusoid of frequency
Fc during a sampling period and by filtering by a low-pass filter with cut-off frequency B / 2 the complex signal obtained. To reduce the volume of data to be processed, sub-sampling of the signal can possibly be carried out.

En pratique, I'angle O est ajusté pour que la rotation e de phase soit exactement un multiple de 2w pendant la durée d'une trame de parole fixée à 22,5 ms par exemple, afin d'utiliser à chaque nouvelle trame les mêmes tables de cosinus et de sinus pour le calcul des parties réelles et imaginaires du produit Sn.exp(-jne) qui peuvent être ainsi mémorisées une fois pour toute dans le dispositif de prétraitement 1.Le filtrage peut alors être obtenu de façon connue en utilisant un filtre d'impulsion à réponse finie en ne calculant qu'un échantillon complexe sur Q par sous échantillonnage d'un facteur Q=4 par exemple, suivant la relation:

dans laquelle hi désigne les coefficients d'un filtre passe bas de fonction de transfert

In practice, the angle O is adjusted so that the phase rotation e is exactly a multiple of 2w during the duration of a speech frame fixed at 22.5 ms for example, in order to use the same frames for each new frame. cosine and sine tables for the calculation of the real and imaginary parts of the product Sn.exp (-jne) which can thus be memorized once and for all in the preprocessing device 1. Filtering can then be obtained in a known manner using a finite response pulse filter by calculating only a complex sample on Q by sub-sampling by a factor Q = 4 for example, according to the relation:

where hi denotes the coefficients of a low pass filter of transfer function

Les autres échantillons ne sont ni calculés ni utilisés dans la suite des traitements. Pour sa réalisation le filtre H(z) n'a pas besoin d'avoir une réponse en fréquence parfaitement plate dans la bande de filtrage et un filtre dont la réponse impultionnelle est le produit d'une fenêtre de Hamming et d'un sinus cardinal peut parfaitement convenir. Une réponse impulsionnelle d'un tel filtre, comportant 31 coefficients et une fréquence de coupure de 500 Hz pouvant convenir, est représenté à titre d'exemple sur la figure 2.The other samples are neither calculated nor used in further processing. For its realization the filter H (z) does not need to have a perfectly flat frequency response in the filter band and a filter whose impulse response is the product of a Hamming window and a cardinal sine may be perfectly suitable. An impulse response of such a filter, comprising 31 coefficients and a cut-off frequency of 500 Hz which may be suitable, is shown by way of example in FIG. 2.

L'autocorrélation proprement dite est effectuée dans chaque sousbande par les dispositifs 21 à 2k sur le carré du module de l'autocorrélation du signal Zn suivant la relation:

dans laquelle N désigne la longueur en nombre d'échantillons de la plage de calcul, et M une valeur maximum de retard exprimée aussi en nombre d'échantillons. A titre d'exemple et en supposant des périodes de fondamental de 160 échantillons au plus dans le signal original et un sous échantillonnage d'un facteur 4, M vaudra un peu plus de 40, pour 1 pitch maxi de 160, sa valeur exacte dépendant de la marge à adopter pour les traitements suivants.The autocorrelation proper is carried out in each subband by the devices 21 to 2k on the square of the module of the autocorrelation of the signal Zn according to the relation:

in which N denotes the length in number of samples of the calculation range, and M a maximum value of delay also expressed in number of samples. By way of example and assuming fundamental periods of 160 samples at most in the original signal and a subsampling of a factor of 4, M will be worth a little more than 40, for a maximum pitch of 160, its exact value depending of the margin to be adopted for the following treatments.

La relation précédente permet de vérifier qu'avec un signal quasi périodique, une période de M et un gain complexe g telle que:
Zn=g.Zn-M l'autocorrélation normalisée p(M) définie précédemment vaut exactement 1 ce qui correspond au but recherché.The previous relation makes it possible to verify that with a quasi-periodic signal, a period of M and a complex gain g such that:
Zn = g.Zn-M the normalized autocorrelation p (M) defined above is worth exactly 1 which corresponds to the desired goal.

Cependant l'autocorrélatîon précédente ne fournit de résultats satisfaisants que si les différentes sous-bandes sont suffisamment larges et contiennent au moins deux harmoniques du fondamental et que son calcul est adapté pour refléter le taux de voisement du signal de parole dans chaque sous bande. A contrario, lorsque le dimensionnement d'une sous bande est insuffisant et ne peut contenir qu'un harmonique de rang k du fondamental Fo, ou bien un harmonique très puissant, accompagné d'autres harmoniques d'amplitudes nettement plus faibles générés par des formants bien marqués le calcul d'autocorrélation précédent ne permet pas de distinguer la période du fondamental car quelle que soit la valeur de M, p(M) est toujours égal ou voisin de 1, ce qui rend impossible une évaluation de la fréquence du fondamental et gêne la suite des traitements.Cette difficulté est surmontée en introduisant dans le calcul un paramètre appelé taux de voisement reflétant la proportion de signal déterministe dans le signal complet. En désignant par v cette proportion, par a2 la puissance moyenne du signal et par 2a2 la puissance moyenne du bruit, le taux de voisement v est déterminé par la relation:
a2
a2 +2a2 2a2
Un calcul exposé en Annexe montre que dans ces conditions une estimation de l'amplitude du signal d'autocorrélation p(M) peut être exprimée par sa valeur moyenne en utilisant la relation:

dans laquelle N est le nombre d'échantillons de signal sur lequel porte l'autocorrélation.However, the preceding autocorrelation only provides satisfactory results if the different sub-bands are sufficiently large and contain at least two harmonics of the fundamental and if its calculation is adapted to reflect the rate of voicing of the speech signal in each sub-band. Conversely, when the dimensioning of a sub-band is insufficient and can only contain a harmonic of rank k of the fundamental Fo, or else a very powerful harmonic, accompanied by other harmonics of significantly lower amplitudes generated by formants well marked the previous autocorrelation calculation does not allow to distinguish the period from the fundamental because whatever the value of M, p (M) is always equal to or close to 1, which makes it impossible to evaluate the frequency of the fundamental and hinders further processing. This difficulty is overcome by introducing into the calculation a parameter called voicing rate reflecting the proportion of deterministic signal in the complete signal. By designating by v this proportion, by a2 the average power of the signal and by 2a2 the average power of the noise, the voicing rate v is determined by the relation:
a2
a2 + 2a2 2a2
A calculation set out in the Appendix shows that under these conditions an estimate of the amplitude of the autocorrelation signal p (M) can be expressed by its mean value using the relation:

where N is the number of signal samples to which autocorrelation relates.

L'intérêt de la relation (6) est qu'elle permet de définir un taux de voisement V qui est déterminé uniquement par la connaissance de l'amplitude moyenne pmoy du signal d'autocorrélation et par le nombre N d'échantillons considérés suivant la relation:

The advantage of relation (6) is that it makes it possible to define a voicing rate V which is determined solely by knowledge of the mean amplitude pmoy of the autocorrelation signal and by the number N of samples considered according to the relationship:

La relation (7) conduit cependant à une surestimation du taux de voisement dans les sous bandes non voisées. En effet, de par la relation 7 le taux de voisement retenu ne peut correspondre qu'à une valeur maximum de p(M) dont la valeur moyenne ne peut être que supérieure à 1/N.Cette difficulté est surmontée selon l'invention en faisant l'hypothèse que le signal d'autocorrélation p possède une densité de probabilité exponentielle décroissante de moyenne m=1/N pour les sons non voisés définie par la relation:

The relation (7) however leads to an overestimation of the voicing rate in the unvoiced sub-bands. Indeed, by relation 7 the voicing rate used can only correspond to a maximum value of p (M) whose average value can only be greater than 1 / N. This difficulty is overcome according to the invention by assuming that the autocorrelation signal p has a decreasing exponential probability density of mean m = 1 / N for the unvoiced sounds defined by the relation:

Dans ce cas la densité de probabilité du maximum de K valeurs de p supposées indépendantes est définie par la relation:

et sa valeur moyenne est donnée par::

In this case the probability density of the maximum of K values of p supposed to be independent is defined by the relation:

and its average value is given by:

A titre d'exemple, en procédant à un calcul d'autocorrélation sur une quarantaine de valeurs correspondant à des valeurs de fréquences fondamentales ou de pitchs allant de 0 et 160 par pas de 4, la moyenne m=1/N doit être remplacée par pom soit environ 4,3.m c'est-à-dire 4,3/N ce qui donne une valeur de taux de voisement empirique corrigée

For example, by performing an autocorrelation calculation on forty values corresponding to fundamental frequency or pitch values ranging from 0 and 160 in steps of 4, the average m = 1 / N must be replaced by pom or about 4.3.m that is to say 4.3 / N which gives a corrected empirical voicing rate value

En reportant la valeur de V ainsi obtenue dans la relation (6), K séries de valeurs d'autocorrélation normalisées notées respectivement p1 (M). .pk(M), sont obtenues, le nombre de termes d'une série étant défini dans chaque sousbande par le nombre de fréquences fondamentales ou pitchs considérés qui varient de 0 à M.Ces valeurs sont appliquées par les dispositifs 21 à 2K sur des entrées correspondantes du dispositif 3 de combinaison des autocorrélations afin d'estimer la période du fondamental. Le procédé consiste à remplacer l'ensemble des corrélations par une corrélation fictive unique qui en est le maximum. Pour toutes les valeurs de M,p(M) est alors défini par la relation:
p(M) = MAX(pl(M), p2(M),-- -PK (M)) (12)
Cette façon de procéder permet à la sous-bande dont la périodicité du signal est la plus marquée d'être distinguée par rapport aux autres pour que l'estimation de la période du fondamental puisse être réalisée sur le signal disponible le plus périodique, lequel n'est pas forcément, pour de la parole perturbée par un bruit de fond important, celui localisé dans les basses fréquences.Naturellement le procédé précédent doit être exécuté pour l'ensemble des valeurs de M. Les valeurs correspondantes de p(M) ainsi obtenues sont appliquées sur des entrées correspondantes du dispositif 5 dévaluation de la période du fondamental afin de déterminer un retard plus ou moins commun à l'ensemble des sous-bandes pour lequel l'autocorrélation fictive définie précédemment présente des maxima cohérents d'une sousbande à l'autre.Ceci a lieu dans le dispositif 5 en recherchant sur l'ensemble des valeurs de M acceptables (typiquement de 20 à 160 ou de 5 à 40 si le pitch est évalué par un pas de 4), les maxima locaux, à savoir ceux pour lesquels les relations suivantes sont vérifiées,
p(M) 2 p(M - 1) et p(M) > p(M + l) (13) ou p(M) > p(M - 1) et p(M) 2 p(M + l) (14) puis à effectuer une interpolation, par exemple parabolique, pour trouver à la fois une position plus précise du maximum, et de sa valeur.La recherche plus précise de la valeur du maximum est effectuée en opérant des décalages de la forme:
1 P(M I)-p(M+1)) (15)
2 p(M - 1)+ p(M + 1)- 2p(M) et le calcul de la valeur du maximum a lieu en appliquant la formule d'interpolation suivante: #par = #(M) = #(#(M-1)+#)(#(M-1)+#(M+1)-2#(M)) (16)
Les valeurs de "ppar" qui sont ainsi calculées pour chacune des valeurs de M sont ensuite comparées entre elles afin de ne retenir comme période candidate Mcand, que celle qui correspond à un maximum de "ppar".By plotting the value of V thus obtained in equation (6), K series of normalized autocorrelation values denoted respectively p1 (M). .pk (M), are obtained, the number of terms of a series being defined in each subband by the number of fundamental frequencies or pitches considered which vary from 0 to M. These values are applied by the devices 21 to 2K on corresponding inputs of the device 3 for combining the autocorrelations in order to estimate the period of the fundamental. The method consists in replacing all of the correlations by a single fictitious correlation which is the maximum. For all the values of M, p (M) is then defined by the relation:
p (M) = MAX (pl (M), p2 (M), - -PK (M)) (12)
This way of proceeding allows the sub-band whose periodicity of the signal is the most marked to be distinguished with respect to the others so that the estimation of the period of the fundamental can be carried out on the most periodic available signal, which n It is not necessarily, for speech disturbed by a significant background noise, that located in the low frequencies. Naturally the preceding process must be executed for all the values of M. The corresponding values of p (M) thus obtained are applied to corresponding inputs of the device 5 for evaluating the fundamental period in order to determine a delay more or less common to all of the sub-bands for which the fictitious autocorrelation defined previously exhibits coherent maxima from subband to l other.This takes place in device 5 by looking over the set of acceptable values of M (typically from 20 to 160 or from 5 to 40 if the pit ch is evaluated by a step of 4), the local maxima, namely those for which the following relationships are verified,
p (M) 2 p (M - 1) and p (M)> p (M + l) (13) or p (M)> p (M - 1) and p (M) 2 p (M + l) (14) then to carry out an interpolation, for example parabolic, to find both a more precise position of the maximum, and of its value. The more precise search for the value of the maximum is carried out by operating shifts of the form:
1 P (MI) -p (M + 1)) (15)
2 p (M - 1) + p (M + 1) - 2p (M) and the calculation of the maximum value takes place by applying the following interpolation formula: #par = # (M) = # (# ( M-1) + #) (# (M-1) + # (M + 1) -2 # (M)) (16)
The values of "ppar" which are thus calculated for each of the values of M are then compared with one another in order to retain as candidate period Mcand, only that which corresponds to a maximum of "ppar".

La valeur Mcand ainsi obtenue est ensuite appliquée par le dispositif 5 sur les entrées respectives des dispositifs 41..4k en même temps que les signaux d'autocorrélation fournis par les dispositifs 21..2k pour rechercher les maximum de taux de voisement. Le traitement correspondant a lieu pour chacune des sous-bandes à partir de pk(M) pour (k=1 à K) en initialisant les calculs par une valeur de M=Mcand et en corrigeant éventuellement M de plus ou moins une unité si nécessaire pour être au voisinage d'un maximum de p. Si aucun maximum n'est trouvé les calculs sont effectués en conservant la plus grande valeur parmi p(M-1), p(M), p(M+1). Par contre si un maximum est trouvé, le calcul doit être poursuivi pour rechercher la présence éventuelle .d'un autre maximum.Ce traitement est aussi réitéré pour des valeurs de M égales à
Mcand/2 et 2.Mcand, si ces valeurs se trouvent dans la plage des valeurs autorisées pour le Pitch. Le meilleur résultat est retenu à chaque fois. A la fin du traitement, K valeurs de p sont ainsi obtenues, une par sous-bande.The Mcand value thus obtained is then applied by the device 5 to the respective inputs of the devices 41..4k at the same time as the autocorrelation signals supplied by the devices 21..2k in order to find the maximum voicing rates. The corresponding processing takes place for each of the sub-bands starting from pk (M) for (k = 1 to K) by initializing the calculations with a value of M = Mcand and possibly correcting M by more or less a unit if necessary to be in the vicinity of a maximum of p. If no maximum is found, the calculations are carried out keeping the largest value among p (M-1), p (M), p (M + 1). On the other hand, if a maximum is found, the calculation must be continued to search for the possible presence of another maximum. This processing is also repeated for values of M equal to
Mcand / 2 and 2.Mcand, if these values are in the range of the authorized values for the Pitch. The best result is retained each time. At the end of the processing, K values of p are thus obtained, one per sub-band.

Pour tenir compte du fait que, dès que le taux de voisement est inférieur à 1, la valeur estimée par le calcul peut varier de façon assez importante autour de sa valeur moyenne les traitements précédents peuvent être complétés en calculant les taux de voisement deux fois plus souvent que nécessaire, par exemple deux fois par trame, pour ne retenir comme valeur de maximum, le maximum de trois valeurs successives, pour favoriser les sons voisés.Enfin, la valeur du voisement retenue pour une sous-bande donnée subit une dernière correction à l'aide d'un double seuillage, suivant lequel, si elle est inférieure à une valeur minimale déterminée, fixant un seuil de non voisement, elle est remise à zéro; si elle est supérieure à une autre valeur de seuil, dit de voisement, elle est mise à 1 et suivant lequel, elle subit entre les deux valeurs une variation linéaire, comme indiqué sur la figure 3. Ce seuillage intervient pour tenir compte du fait que, même pour des sons parfaitement voisés, la valeur observée peut descendre à environ 0,8, au lieu de 1 en théorie et que pour des sons purement non voisés la valeur observée peut fréquemment atteindre 0,3 à 0,4. Pour tenir compte également du fait que le taux de voisement mesuré décroît lorsque la fréquence croît, à cause de phénomènes complexes dits de "micromélodie" ou "bitter", apparaissant plus particulièrement pour les voix graves, les seuils définis précédemment, peuvent toujours être diminués en fonction de la fréquence centrale de la sous-bande considérée. To take into account the fact that, as soon as the voicing rate is less than 1, the value estimated by the calculation can vary quite significantly around its average value the previous treatments can be supplemented by calculating the voicing rates twice as much often as necessary, for example twice per frame, to not retain as maximum value, the maximum of three successive values, to favor voiced sounds. Finally, the value of voicing retained for a given sub-band undergoes a final correction to using a double threshold, according to which, if it is less than a determined minimum value, fixing a non-voicing threshold, it is reset to zero; if it is greater than another threshold value, called the voicing threshold, it is set to 1 and according to which, it undergoes a linear variation between the two values, as indicated in FIG. 3. This thresholding intervenes to take into account the fact that , even for perfectly voiced sounds, the observed value can drop to around 0.8, instead of 1 in theory and that for purely unvoiced sounds the observed value can frequently reach 0.3 to 0.4. To also take into account the fact that the measured voicing rate decreases when the frequency increases, because of complex phenomena called "micromélodie" or "bitter", appearing more particularly for low voices, the thresholds defined previously, can always be lowered as a function of the center frequency of the sub-band considered.

On comprendra que l'ensemble des traitements précédemment décrits pourront être mis en oeuvre à l'aide de microprocesseurs de traitement du signal du commerce convenablement programmés. It will be understood that all of the processing operations described above may be implemented using commercially programmed signal processing microprocessors.

ANNEXE
En désignant par: - M la valeur du pitch, - N le nombre d'échantillons utilisés pour calculer l'autocorrélation, - S la plage courante de N échantillons complexes déterministes (périodiques), à laquelle est additionné une plage de bruit X - S (-M)= # S la plage précédente d'échantillons déterministes, décalée de M échantillons, proportionnelle à S, à laquelle est additionné une plage de bruit
Y.ANNEX
By designating by: - M the value of the pitch, - N the number of samples used to calculate the autocorrelation, - S the current range of N deterministic (periodic) complex samples, to which is added a noise range X - S (-M) = # S the previous range of deterministic samples, shifted by M samples, proportional to S, to which a noise range is added
Y.

La valeur de l'autocorrélation calculée devient, en notation vectorielle:

The value of the calculated autocorrelation becomes, in vectorial notation:

En supposant que le signal a une puissance moyenne de a2 et le bruit une puissance moyenne de 2a2, on montre assez facilement que l'espérance mathématique du numérateur est égale à N2 a4 |#| +2Na N2 a4I2 |#| + 2N#(2# + a ( 1+|#|)) et celle du dénominateur à: N2 (a2 +a2)(|#|ȃ+2#)
En assimilant l'espérance mathématique de p au rapport de ces deux quantités, ce qui donne:

Assuming that the signal has an average power of a2 and the noise an average power of 2a2, it is fairly easy to show that the mathematical expectation of the numerator is equal to N2 a4 | # | + 2Na N2 a4I2 | # | + 2N # (2 # + a (1+ | # |)) and that of the denominator at: N2 (a2 + a2) (| # | ȃ + 2 #)
By assimilating the mathematical expectation of p to the ratio of these two quantities, which gives:

Pour un bruit nul, cette quantité vaut 1, et, pour un signal nul, 1/N. For zero noise, this quantity is equal to 1, and, for a zero signal, 1 / N.

Lorsque le signal est stationnaire, on peut considérer que X est de module unité, et l'expression ci-dessus se simplifie en:

When the signal is stationary, we can consider that X is of unit modulus, and the expression above is simplified by:

Claims

CLAIMS 1- Method for the evaluation of the voicing of the speech signal by sub-bands in vocoders characterized in that it consists, after having sampled then cut the speech signal into sub-bands, to transpose the resulting signal in each baseband sub-bands (21 ... 2k) to calculate the square of the autocorrelation module p (M) of the signal transposed in each sub-band (21 ... 2k) to be estimated (3.5) the value of the fundamental frequency of the speech signal by selecting from the autocorrelations performed on the signals present in all of the sub-bands the one whose amplitude is maximum, to be selected (41 ... 4k) in each sub-band the maximum autocorrelation which corresponds to a value either of the estimated fundamental frequency, of its multiples and submultiples or of a value of their neighborhood, and to correct (41 4k) in voicing rate the values of the autocorrelations obtained before their transmission .

2- A method according to claim 1 characterized in that the calculation of the square of the autocorrelation module (21..2K) takes place on a determined number M of fundamental frequency values, in each of the sub-bands.

3- A method according to claim 2 characterized in that the voicing rate value is determined by correcting the value of the square of the module of the autocorrelation p (M) obtained for a fundamental frequency value M by the relation:

where N denotes the number of samples considered and po is a value to be adjusted according to the applications.

4- A method according to claim 3 characterized in that the search for autocorrelation maximums is carried out by interpolation calculations between determined values of the fundamental frequency.

5- Method according to claim 4 characterized in that the sampled speech signal undergoes a preprocessing before being cut into sub-bands consisting of filtering by a transfer function filter

where the coefficients a are obtained by autocorrelation of the sampled speech signal and there is a constant close to 0.7.

6- Device for implementing the method according to any one of claims 1 to 5 characterized in that it comprises a preprocessing device (1) of the speech signal, coupled to a set of k devices for calculating autocorrelation ..... .2k) of the corrected speech signal in k subbands, and a device for combining the results of the autocorrelations (3), coupled to k devices for finding the maximum voicing rate .4k) from the values autocorrelation provided by the k autocorrelation calculation devices, by means of a device for evaluating the period of the fundamental of the speech signal (5).