WO2004049303A1 - Analysis of the vocal signal quality according to quality criteria - Google Patents

Analysis of the vocal signal quality according to quality criteria Download PDF

Info

Publication number
WO2004049303A1
WO2004049303A1 PCT/IB2003/006355 IB0306355W WO2004049303A1 WO 2004049303 A1 WO2004049303 A1 WO 2004049303A1 IB 0306355 W IB0306355 W IB 0306355W WO 2004049303 A1 WO2004049303 A1 WO 2004049303A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
vocal
module
input
modules
Prior art date
Application number
PCT/IB2003/006355
Other languages
French (fr)
Inventor
Anne Blampoix
Original Assignee
Vocebella Sa
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vocebella Sa filed Critical Vocebella Sa
Priority to AU2003288475A priority Critical patent/AU2003288475A1/en
Publication of WO2004049303A1 publication Critical patent/WO2004049303A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B19/00Teaching not covered by other main groups of this subclass
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids

Definitions

  • the present invention relates to a method of analyzing at least one sound signal, in particular for extracting characteristics therefrom.
  • the aim of the present invention is more particularly to analyze one or more voices taken by themselves or in conversation.
  • a voice is not static and it changes according to many somewhat random parameters, such as the time of day, the weather, one's moods, emotions, health, lifestyle, etc.
  • the need to control one's voice, whatever the circumstances, has become increasingly relevant, especially in certain fields in which the vocal instrument assumes a great importance, such as those of television actors, conference speakers, singers, etc.
  • US-2002/0010587 teaches us a system, a method and an article attempting to detect edginess in the voice.
  • Document WO 01 16,938 proposes a system, a method and an article that appear to be capable of detecting certain emotions in a voice.
  • a first main objective of the present invention is to measure a quality level of a voice according to one or more voice quality criteria.
  • a second main objective of the present invention is to measure a quality level of a conversation between various voices according to one or more conversation quality criteria.
  • a third objective is to diagnose the condition of a voice according to the measured quality levels of a voice.
  • a fourth objective is to choose exercises tailored to the diagnostics provided.
  • the invention provides a method of analyzing at least one vocal signal, characterized in that it is implemented by elementary signal processing operations managed by respective modules, each module being capable of converting at least one module input signal into a module output signal representative of a given characteristic of the module input signal, and in that it involves the use of a means for processing the signal from a given module or from a given combination of given modules that receive, as input, at least one vocal signal and deliver, as output, a signal representative of at least one quality level of the vocal signal according to a given quality criterion.
  • figure 1 shows a list of elementary vocal signal processing modules according to the invention
  • figure 2 shows a list of criteria based on the quality of a vocal signal according to the invention
  • figure 3 shows a diagram of a modular configuration of a speech content criterion according to the invention
  • figure 4 shows a diagram of a modular configuration that can provide a diagnostic of the vocal state of a vocal signal according to a speech content criterion according to the invention
  • figure 5 shows a diagram of a modular configuration of a common long silence content criterion according to the invention
  • figure 6 shows a diagram of a modular configuration that can provide a diagnostic of the state of a conversation between vocal signals according to a long silence content criterion according to the invention
  • figure 7 shows a diagram of a modular configuration of a criterion based on the number of long silences of
  • a sound signal is a continuous acoustic pressure wave propagating in time and in space, generated by a sound source.
  • a vocal signal is a sound signal emitted directly or indirectly by a human being or by an animal.
  • the sound signals to be studied relate particularly to those emitted by a human being.
  • the vocal source to be analyzed may be: vibrations of vocal chords of one or more individuals, therefore emitting a voice directly; or - the play-back of a voice recording; or a vocal signal obtained after an artificial vocal creation, that is to say based on non-living devices or instruments capable of creating human voices.
  • the recording may be produced on any recording medium, such as an audio tape, a CD ROM, a hard disk, a floppy disk, etc.
  • the recording format may be analog or digital, such as for example the WAV digital format.
  • the analog signal is denoted by S(t) and is a real signal being output continuously within the time period between 0 and T, measuring the acoustic pressure emitted by one or more vocal sources at each instant t.
  • This analog vocal signal may, for example, be received by an acoustic microphone which then converts the acoustic information into electrical information in order to be able consequently to carry out signal processing using electrical and/or electronic means such as electronic processors and memories.
  • the processing of the signal may then be carried out in an analog manner or digitally.
  • analog signal sampling the samples being advantageously taken over time in a regular fashion, each time interval separating two consecutive signal sampling events being defined by a sampling period T e , a sampling frequency F e being equal to 1/T e ;
  • the chosen sampling frequency within the context of the invention is preferably
  • the signal analysis of the present invention is usually carried out locally; analyses will therefore be preferentially carried out on signal portions that will be isolated within weighting windows.
  • the signal is multiplied by a compact support function, which is, more precisely, zero outside the time interval of study, also called a weighting function w(k), k representing a set of positive integers between 0 and M-l and M being an integer giving the number of points contained within the weighting window of temporal extent MT e .
  • 1 representing a set of positive integers between 0 and L-1 and L being an integer giving the number of instants of analysis.
  • is less than or equal to M so as to have at least one analysis per weighting window.
  • the digital signal is then processed and analyzed directly, or is recorded in an electrical or electronic memory to be analyzed later.
  • the analysis of a vocal signal does not refer only to temporal analysis of the vocal signal, but also to frequency analysis.
  • a short-term frequency analysis of the signal is advantageously carried out by applying, to the temporal frames, a Fourier transform, also called FFT.
  • FFT Fourier transform
  • the frequency resolution, or frequency spacing, of the signal is given by the expression FJN.
  • the frame is advantageously supplemented with zeros until obtaining the N points needed for the calculation of s,(n).
  • represents the mean intensity over frame 1 of the frequency nF e / N, and constitutes the spectrum of the signal.
  • a frequency spacing F e /N equal to 2.6 Hz is then obtained, which is a value small enough to make it possible to distinguish, within the spectrum of the similar vocal frequencies, a frequency of a human voice, which may possibly vary from about 70 to about 1100 Hz.
  • a representation in terms of gray levels of the signal, with the instants of analysis ti plotted on the x-axis and the frequencies nFJN plotted on the y-axis and the amplitudes in dB represented in gray levels will be called here a "spectrogram”.
  • the spectral signal and the temporal signal obtained directly from the original vocal emission then constitute the raw material on the basis of which signal analyses will be carried out in order to extract the desired characteristics therefrom.
  • the signal analysis methodology that will be used here is based on elementary signal processing steps controlled by respective modules.
  • One module stored in memory, usually represents an algorithm of the conversion of at least one input signal into an output signal representative of a given characteristic of the input signal.
  • An electrical or electronic device such as a processor, is advantageously used in the signal analysis method to recover the vocal signals, carry out analytical calculations on the signals on the basis of the modules stored in memory and recover the signals representative of information coming from the vocal analysis calculations in order to store this information in memory and/or send it to a communication means capable of communicating this information to a person in a format comprehensible by that person, such as a graphics display format using a screen as medium.
  • FIG. 1 shows a module identified by a number (e.g. Ml) which will be adopted in the rest of the document.
  • the description of the modules is of the input/output type, namely inputs on the left of the module and output on the right of the module.
  • the analysis of the signal necessarily starts by using this module Ml .
  • This use of the module allows the vocal signal to be processed in order to have, as output from the module, a digital acoustic pressure signal, propagating over discretized time, characterized by a sampling frequency.
  • the discretized time has its values within the real interval [-1;1].
  • - background-noise and speech level estimation module M2 the use of which comprises the steps consisting in: receiving, as module input, a vocal signal; delivering, as module output, a signal representative of at least one maximum background noise level threshold and a minimum speech level threshold of the vocal signal received as module input.
  • the minimum speech level threshold is generally found from the background noise level threshold increased by a certain value, which may be zero in certain cases. The sole signal estimate remaining to be made is then an estimate of the background noise.
  • the estimation of the background noise is a necessary step in order to be able to distinguish, in a vocal signal, "that which is heard from that which is not heard".
  • That which is heard means here that which emerges sufficiently from the background noise.
  • the background noise is estimated from a recording without any voice.
  • This recording is advantageously used shortly before the start of the emission of the vocal signal that it is desired to analyze and under substantially the same conditions so that the background noise does not change significantly, and therefore so that the background noise data recorded is substantially identical to the background noise data of the vocal signal.
  • the recorded noise signal denoted S t> (t), with a time parameter t that is between 0 and T, is advantageously digitized using the method described above, delivering a digital temporal signal S b (k) and a digital frequency signal ⁇ t . ⁇ (n). It should be noted that the background noise measurement time T must be long enough for the statistics employed to be meaningful.
  • the background noise denoted by bgn(n) is advantageously estimated as a maximum envelope of the spectrum, frequency by frequency.
  • the background noise bdf(n) is then especially a function of:
  • nF e /N The expression for the background noise at the frequency nF e /N is then advantageously given by: a being a multiplying coefficient, which is to be chosen. It may in particular have to be linked to a certain threshold value of a Gaussian distribution.
  • a multiplying coefficient of 2 is advantageously linked to a threshold of a Gaussian distribution in which 2.5% of the samples exceed this threshold.
  • the background noise is estimated directly on the recording of the vocal signal, and not on a separate recording without a voice, as in the case of the first method of determining the background noise.
  • a first range of the recording of the vocal signal contains a recording of the silence, as it had been done during the first method of determining the background noise, for a typical duration of a few seconds, followed directly by a recording of the signal containing the vocal information in a second range of the recording.
  • a first step of determining the background noise consists in separating the silence range from the non-silence range.
  • a second step of determining the background noise is then identical to the first method of determining the background noise.
  • - silence zone and speech zone segmentation module M3 the use of which comprises the steps consisting in:
  • the output signal is advantageously a binary signal, for example with a 0 signal level assigned to the silence zones and a 1 signal level assigned to the speech zones.
  • this module is therefore used to recognize the silence zones from the speech zones in the vocal signal.
  • the zones of the temporal signal that have an amplitude and/or an intensity above a defined threshold value or several defined threshold values are regarded as constituents of the vocal information.
  • the other zones of the temporal signal are regarded as silence zones in the vocal signal.
  • the module thus acts as a vocal signal filter, especially with reference to the background noise signal (thus representing a "silence” reference in the vocal signal) in order to distinguish the spoken sound from its noise sound, and thus separating the speech zones from the silence zones.
  • Analyses carried out after segmentation of the signal into speech and silence zones may thus be created and carried out, such as signal overlap, silence duration and speech duration analyses, or other analyses for identifying, for example, speech zones that would correspond in fact to noise zones, such as for example smacking of the lips, and noise zones that would correspond to speech zones.
  • modules such as the following five modules may be used after module M3: - level occupancy module M4A, the use of which comprises the steps consisting in: receiving, as module input, a temporal signal divided amplitudewise into at least two levels; and
  • the module output signal is then representative of the speech content in the vocal signal.
  • common level occupancy module MB4 the use of which comprises the steps consisting in: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudwise into at least two levels; and
  • the module output signal is then representative of the amount of silence occupied in common by the vocal signals.
  • module M4C for the number of long level intervals, the use of which comprises the steps consisting in: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the number of long time intervals within a given signal level of at least one temporal signal, an interval becoming long, on the basis of a stored threshold interval value, after a time interval of at least one other temporal signal in a level other than the given level.
  • the module output signal is then representative of the number of long silence intervals of the first signal that follow the speech intervals of the second signal.
  • module M4D for the number of level overlaps, the use of which comprises the steps consisting in: 5* receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the number of time intervals for which at least two signals have respectively the same given signal level, at least one of these signals not having this given level after the interval and at least one other of these signals not having this given level after the interval.
  • the module output signal is then representative of the number of speech overlaps of the first and second signals.
  • - module M5 for segmentation of the steady zones, the use of which comprises the steps consisting in: > receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of the division of a vocal signal into respective silence and speech time zones; and delivering, as module output, an output signal representative of a division of the input vocal signal into steady and nonsteady zones - a zone of the temporal vocal signal is steady if the portion of the signal that it contains is sufficiently distinct from the portions of the signal that are adjacent to the zone, and especially if there is a sufficient break between characteristics of the signal contained in the zone as zone output and/or input and characteristics of the portions of the signal that are adjacent to the zone, and such a break is sufficient if it is greater than a stored threshold break value, the output signal consisting of the vocal input signal with a given signal level replacing the silence zones and the nonsteady zones.
  • This module therefore identifies the steady zones of the signal by statistical estimation of the model break type.
  • the stored model may be an identification of a sound or of a voice pitch or the like.
  • This module is used in particular to identify phonemes in a vocal signal.
  • those parts of the vocal signal corresponding to the speech zones may then be analyzed so as to determine the vocal quality of this signal.
  • - sound pitch module M7 the use of which comprises the steps consisting in: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of a division of a vocal signal into respective silence and speech time zones; and delivering, as module output, an output signal representative of the respective local fundamental frequencies of each speech zone of the vocal signal.
  • the sound pitch corresponds to the fundamental frequency perceived at each instant.
  • This module detects the pitch in the various time frames of each speech zone.
  • the processing associated with this module advantageously takes place in two steps:
  • the first processing step comprises firstly detection of the partials, each partial being a sinusoidal temporal component of the vocal signal represented by spectral lines. It should be noted that the spectral lines are broad and may also possess secondary lobes as a result of convolutions of the temporal signal by the weighting function chosen for the analysis. Detection of the partials takes account of:
  • the center of a partial is defined here by a strict local maximum of the spectrum which: - emerges sufficiently above the background noise;
  • the start of the partial generally corresponds to the lowest local minimum to the left of the center of the partial within a size limit imposed by the width of the weighting window.
  • the start is advantageously denoted as being the boundary point of the weighting window.
  • the algorithm used in this module uses especially curve mask techniques to isolate the partials.
  • the input data for the algorithm are:
  • x(n) 201og 10
  • y(n) a reference base y(n)
  • n an integer between 0 and N-l
  • y(n) a reference base y(n)
  • n an integer between 0 and N-l
  • y(n) a spectrum constituting a floor value for detecting the peaks of the partials, and taking at least partly the spectrum of the background noise
  • n being an integer between 0 and N-l, set to infinity in negative values or to a high negative value in absolute value, takes into account the amplitude of the mask induced by each peak detected, z(n) advantageously being expressed in decibels.
  • the vocal signal x(n) as module input is smoothed.
  • the envelope having the maximum amplitude is determined, which will define the main lobe or peak of analysis weighting window, which will constitute a reference in the rest of the analysis.
  • - D which is a half-width, that is to say the distance separating the start (or the end) of the peak from its center; it is preferably set as the half-width of the main lobe of the FFT of the weighting window;
  • the attenuation is preferably set by the attenuation of a secondary lobe with respect to the main lobe of the FFT of the weighting window increased by 5 dB;
  • - P which is the multiplicative slope in dB/octave of the mask of each peak; it is preferably set as being the attenuation slope of the secondary lobes and thus depends in general on the weighting window; and - H, which is the minimum height of a peak and relative to the highest peak; the minimum height is preferably set at 60 dB - a deviation from the height of a peak by more than 60 dB relative to the height of the main peak therefore means a nonextended peak.
  • the circulation steps carried out by the algorithm may be, for example, the following in succession:
  • n is a strict local maximum for * which emerges sufficiently from the base (that is to say x(n) > y(n) + E) and from the mask (that is to say x(n) > z( ) ), then:
  • n is adopted as being the middle of a peak
  • the start of the peak is then sought, starting from the middle of the peak, without going beyond the half-width, for an integer y varying from n - 1 to n -D : 1.1.2.1. if j is a local minimum of the signal x, j is the start of the peak;
  • the mask is reset only away from the peak found, the new mask being the maximum between the old mask and the expected attenuation on the secondary lobes of the peak (partial) detected.
  • This attenuation equal to A in j - D, possesses a slope of R (in dB/Hz) per octave and is symmetrical with respect to the middle of the peak;
  • the first processing phase is then based on a family of partials of a spectrum of the vocal signal, on the basis of which the module M7 carries out the following steps:
  • Extracted firstly from this family is a sufficiently energetic and populated subfamily, representative of the main harmonics of the human voice.
  • the partials that emerge from the background noise by at least a value E(, typically equal to 5 dB, are selected. If this selection contains less than a defined minimum number of partials, typically equal to 3, or if the selection contains no partials emerging from the background noise by a value of at least E2, typically equal to 20 dB, then it is considered that the spectrum analyzed contains no pitch.
  • an energy threshold reference is set equal to 0 for the lowest partial and an energy ceiling reference is set equal to 1 for the highest partial, the height of the partial being found at the center of the partial, the respective energies of the other partials then lying between these two references.
  • a partial For a partial to be considered as the partial corresponding to a fundamental frequency, taken at the center of the partial and denoted by f0, it must satisfy certain conditions.
  • - the energy of the partial exceeds a threshold value, typically equal to 0.7 if the energies of the partials are considered to be between 0 and 1;
  • a subharmonic is of rank 1 if a partial containing f ⁇ /2 exists, and the center of which is located less than a certain frequency from f ⁇ /2, typically equal to 3 Hz, and the energy of which differs from the energy of the partial of the hypothetical fundamental frequency by less than a certain energy difference, typically equal to 20 dB;
  • a superharmonic is of rank 1 if a partial containing fO*2 exists, the center of which is located less than a certain frequency difference from f0*2, typically equal to 3 Hz and the energy of which differs from the energy of the partial of the hypothetical fundamental frequency by less than a certain energy difference, typically equal to 20 dB.
  • the first one that is to say that representing the lowest frequency
  • the fundamental is declared to be present.
  • its rank in the harmonics is then calculated (0 if no harmonic, k if k*fO is contained in the partial).
  • the pitch is re-estimated by interpolation of the positions of the centers of the "harmonic" partials in the ranks of these harmonics.
  • the harmonics of rank of below a certain value, typically 10 are selected. In the case in which the rank of the harmonics is less than this value, the pitch is not re-estimated. Otherwise, a re-estimation of the pitch is carried out.
  • the partial therefore contains the harmonic of rank k if ka + b e [E, ; F 2 ] .
  • the ranks of the partials are then re-estimated.
  • a second phase of the processing carried out by module M7 consists in removing overall nonconforming points and local nonconforming points.
  • Points are considered as being nonconforming relative to a predetermined norm, which may be an overall norm (that is to say over all the analysis windows) or a local norm (that is to say only over a single analysis window).
  • a predetermined norm which may be an overall norm (that is to say over all the analysis windows) or a local norm (that is to say only over a single analysis window).
  • the mean m of the standard deviation ⁇ of the pitches of the vocal signal, in 440 Hz semitone, that are obtained on a temporal family of spectra is calculated.
  • These statistics are advantageously calculated after eliminating the X highest values and the Y lowest values, X and Y typically and respectively being equal to 0 and 1.
  • an acceptance threshold is calculated: the accepted x values are then those for which: be - m ⁇ > a ⁇ ; a being a predetermined coefficient and advantageously chosen depending on the type of sound that is expected of the signal, or depending on a more ad hoc distribution model than the Gaussian model. a is typically equal to 4.
  • One solution consists in constituting hard thresholds, corresponding to sound pitches that cannot be attained by a human being, or cannot be obtained owing to the profile of the speaker/singer, or that cannot be obtained owing to the demand placed on the speaker/singer.
  • Removal of the local nonconforming points makes it possible, for its part, to remove false pitch detections of the fO/2 or 2ft) style.
  • a method proposed here consists in examining, in sliding time slots, the pitches detected.
  • the nonconforming points are identified by comparing the scanning window with a left-hand window (located immediately to the left of the scanning window) and a right-hand window (located immediately to the right of the scanning window). In order for there to be a local nonconforming point, it is then necessary for:
  • This module identifies the energy distribution of each speech zone according to the various harmonics detected.
  • the mean energy of a speech zone of the vocal signal is the energy of the signal of the useful frequencies located within the harmonic partials, the useful frequencies of a speech zone being those of the parameterizable band [E n • repeat ;F m ].
  • the mean is advantageously calculated over substantially all the spectra having a pitch.
  • the signal is thresholded to zero below the background noise.
  • the energy is an L2 norm on a linear spectrum (abs(FFT)). More precisely:
  • - sound volume module Ml 8. the use of which comprises the steps consisting m: receiving, as module input, a vocal signal;
  • module output an output signal representative of a temporal distribution of the sound volume of the vocal signal.
  • This module calculates the local sound volume of the audio signal as input.
  • vocal characteristics are then analyzed, according to quality criteria of the incoming vocal signal, by modules according to at least one of the two following methods: - by calculating, from these characteristics, quantities representative of the quality levels of the vocal signal according to given quality criteria; or
  • the algorithm for the calculations of this type of analysis being contained in one or more modules.
  • the use of a given module or of a given combination of given modules after receiving, as module input, a vocal signal and/or a signal after processing of the vocal signal delivers a module output signal representative of a classification of at least a portion of the vocal signal in a given category of a given vocal criterion, according to the following steps: - reception of at least one portion of the signal or signals representative of at least one quantity;
  • vocal quality criterion a sound pitch criterion
  • the stored categories then representing various sound pitches associated with configured frequency intervals representative of a set of pitches of an audible signal.
  • voice pitch categories it is then possible, by comparing the pitch of a vocal signal with them, to find the voice pitches contained in this signal.
  • sound pitch categories deep (150 ; 250] Hz for example), medium ([275 ; 351] Hz for example), and medium-high ([351 ; 450] Hz for example), or else: bass, baritone, tenor, contralto, soprano, etc.
  • the input vocal signal is a sung voice from which it is intended to determine the notes emitted based on more complex criteria.
  • a note of the sung signal is in particular identified, but from a pitch, at the ends, (start and end of a note) of the pitch by a localization of the breaks in the pitch curve. These breaks coincide with the local maxima of the module of the derivative (that is to say the high-slope points of inflexion). These breaks are combined with the natural boundaries between notes, such as long undetected pitch ranges. It should be pointed out that notes sufficiently close together (in terms of time and pitch) are merged into a single note.
  • module M8 for classification in a given sound, the use of which comprises the steps consisting in: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of at least one local fundamental frequency corresponding to at least one respective speech zone of a vocal signal;
  • This module especially detects types of vowels present in the vocal signal, especially by means of the local pitch received on an input.
  • a stored voice pitch model already discussed earlier, also called a voice register, the voice pitch categories of which are defined by vocal frequency intervals;
  • the comparison step is carried out according to the following two main steps:
  • module M20 for classification according to a given model (not shown in a figure), the use of which comprises the steps consisting in: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of at least one sound category of a vocal signal; > comparing the input vocal signal and the sound category or categories of the vocal signal with at least one stored signal quantity representing a leveled threshold, defining at least two regions, each region being associated with a given level depending on the given model type; deducing the level or levels to which the respective sound category or categories of the input vocal signal belongs or belong; and delivering, as module output, an output signal representative of the level or levels deduced from the vocal signal according to the given model.
  • module Ml 6 for calculating a voice pitch difference relative to a voice pitch model the use of which comprises the steps consisting in: receiving, as module input, a signal representative of at least one fundamental frequency of a vocal signal;
  • This module calculates the separation between the input pitch of the module and a fixed pitch model.
  • module Ml 7 for calculating the difference in voice intonation relative to a voice intonation model, the use of which comprises the steps consisting in: > receiving, as module input, a signal representative of a temporal change of at least one fundamental frequency of a vocal signal; comparing the temporal change of the input fundamental frequency with a stored intonation model;
  • This module calculates the separation between the input pitch of the module and a fixed intonation model.
  • module M6 for classifying a quantity of a vocal signal, the use of which comprises the steps consisting in:
  • This module thresholds each quantity that is presented to it as input.
  • - module M10 for classifying a quantity of a vocal signal according to an input parameter comprises the steps consisting in: receiving, as a first module input, a signal representative of at least one quantity of a vocal signal and, as a second module input, a signal representative of at least one category of a parameter of a vocal signal; > comparing the input quantity with at least one stored quantity, defining at least two regions, each region being associated with a given category of a given vocal criterion, the value of each stored quantity being a function of the input parameter or parameters; deducing the category to which each quantity of the input vocal signal belongs;
  • This module automatically thresholds an input quantity according to an input parameter.
  • a quality criterion of a given vocal signal will be defined by a set of given modules connected together in a given combination and receiving, as input, at least one vocal signal and delivering, as output, a signal representative of a vocal signal quality level according to a quality criterion given by the combination of the modules.
  • Two broad categories of criteria may be defined: - the vocal quality criteria of the input vocal signal, which give a quality level of the emitted voice; with reference to figure 2, this category comprises the following criteria: vocal tonicity C6, vocal presence C7, vocal nasality C9, voice correctness C12 and voice intonation C13; - the quality criteria of a conversation, a conversation involving an interaction of a number of separate, preferably synchronized, vocal signals which give a quality level of the conversation; with reference to figure 2, this category comprises the following criteria: speech content of one of the vocal signals of the conversation Cl, content of long silences common to vocal signals of the conversation C2, number of long silences in one of the signals of the conversation C3, number of signal level overlaps between vocal signals of the conversation C4 and speech rate of one of the vocal signals of the conversation C5.
  • - speech content criterion Cl this comprises, as shown in figure 3, the modules M2, M3 and M4A (the module M4A giving a temporal occupancy of a temporal signal within a fixed speech level, these being configured so that a vocal signal), after having been processed by a module Ml, is received at the input of the module M2 and at the first input of the module M3, the output signals of the modules M2 and M3 are then sent to the second inputs of the modules M3 and M4A respectively, the output signal of the module M4A then being representative of the speech content in the vocal signal.
  • This criterion therefore makes it possible to obtain the speech time of the speaker relative to a signal duration.
  • n modules M3 and a module M4B with n inputs corresponds, as shown in figure 5, to a number n of modules M2, n modules M3 and a module M4B with n inputs (the module MB4 giving a simultaneous temporal occupancy of n temporal signals within a fixed silence level), these being configured so that a number n of vocal signals (n is, in the example illustrated in figure 2, equal to 2 and the signals are labeled PI and P2) after having each been treated by a module Ml, are received, in the case of each of them, at a respective input of a module M2 and at a first input of a module M3 so that each module M2 or M3 receives only a single vocal signal, the output signal of each module M2 is then transmitted to the second input of the module M3 that has received the same vocal signal at its first input as that received by this module M2, each of the output signals of the modules M3 are then transmitted respectively to a single input of the module M4B so that each input of the module M4
  • - criterion C3 for the number of long silences of a given vocal signal comprises, as shown in figure 7, two modules M2, two modules M3 and one module M4C having two inputs (the module M4C giving a number of long time intervals within a fixed silence level of a temporal signal), these being configured so that two vocal signals, after having each been processed by a module Ml, are each received at a respective input of a module M2 and at a first input of a module M3 so that each module M2 or M3 receives only a single vocal signal, the output signal of each module M2 is then transmitted to the second input of the module M3 having received the same vocal signal at its first input as that received by this module M2, each of the output signals of the modules M3 are then sent respectively to a single input of the module M4C so that each input of the module M4C receives only a single signal, the output signal of
  • the criterion output number therefore represents the number of time intervals corresponding to a silence of a first interlocutor after an intervention by the second interlocutor.
  • - criterion C4 for the number of speech interruptions of a first signal comprises, as shown in figure 9, two modules M2, two modules M3 and one module M4D having two inputs (the module M4D giving a number of time intervals for which two signals have respectively the same fixed speech level), these being configured so that two vocal signals, after having each been processed by a module Ml, are each received at a respective input of a module M2 and at a first input of a module M3 so that each module M2 or M3 receives only a single vocal signal, the output signal of each module M2 is then transmitted to the second input of the module M3 having received the same vocal signal at its first input as that received by this module M2, each of the output signals of the modules M3 are then respectively transmitted to a single input of the module M4D so that each input of the module M4D receives only a single signal, the output signal of the module M4D then being representative of the number of speech interruptions of one of the two vocal signals received.
  • the criterion output number
  • - speech rate criterion C5 this comprises, as shown in figure 11, the modules M2, M3 and M5 which are configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3 and M5, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M5, the output signal of the module M5 then being representative of the speech rate level in the vocal signal.
  • This criterion therefore makes it possible to measure the speech rate of a speaker. This rate is expressed in a unit proportional to the number of phonemes pronounced by the speaker.
  • - vocal tonicity criterion C6 this comprises, as shown in figure 13, the modules M2, M3 and M9 that are configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the respective first inputs of the modules M3 and M9, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M9, the output signal of the module M9 then being representative of the level of vocal tonicity in the vocal signal.
  • This criterion measures the tonicity of the voice of a speaker, this being inversely proportional to the vocal fatigue.
  • the vocal tonicity here is directly associated with the energy in the voice - it may also be representative of a breathiness level in the voice.
  • a breath is recognized if the voice is not pure, that is to say if it also expends energy also to generate background noise in addition to creating the desired sounds. It is especially by comparing the ratio of the energy of the vocal sound (that is to say the energy of the harmonic frequencies) to the nonvocal sound frequencies (that is to say the energy of the nonharmonic frequencies) that one happens to find a vocal tonicity level.
  • - vocal presence criterion C7 this comprises, as shown in figure 15, the modules M2, M3, M7, M8 and Mi l, the module Mil being a module M20 capable of classifying a vocal signal by level according to a given vocal presence model, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3, M7, M8 and Mil, the output signals of the modules M2, M3, M7 and M8 are then transmitted respectively to the second inputs of the modules M3, M7, M8 and Mil, the output signal of the module M20 then being representative of the vocal presence level in the vocal signal.
  • This criterion measures the vocal presence of a speaker, that is to say a capability of a voice to hold the attention of its audience.
  • the vocal presence is especially determined by determining low frequencies in the signal.
  • this comprises the modules M2, M3, M7, M8 and M20, the module M20 being capable of classifying a signal by level according to a given voice model, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of a module M2 and at the first respective inputs of the modules M3, M7, M8 and M20, the output signals of the modules M2, M3, M7 and M8 are then transmitted respectively to the second inputs of the modules M3, M7, M8 and M20, the output signal of the module M20 then being representative of a level of the voice model in the vocal signal.
  • the given voice model is advantageously a vocal nasality.
  • vocal nasality criterion C9 this comprises, as shown in figure 17, a module Ml 3 which is the module M20 capable of classifying a signal per level of vocal nasality.
  • This criterion measures the vocal nasality level of a speaker.
  • - voice correctness criterion C12 this comprises, as shown in figure 17, the modules M2, M3, M7 and Ml 6, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3, M7 and Ml 6, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M7, the output signal of the module M7 is then transmitted to the input of the module Ml 6, the output signal of the module Ml 6 then being representative of a voice pitch difference in the vocal signal relative to a stored voice pitch model.
  • This criterion measures the correctness of the voice relative to a fixed model.
  • - voice intonation criterion C13 this comprises, as shown in figure 21, the modules M2, M3, M7 and Ml 7, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3, M7 and Ml 7, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M7, the output signal of the module M7 is then transmitted to the input of the module Ml 7, the output signal of the module Ml 7 then being representative of a difference in intonation in the vocal signal relative to a stored intonation model.
  • This criterion measures the separation between the intonation of the voice of the speaker and that of a fixed model.
  • each comprise at least one initial signal processing step.
  • Each of these initial signal processing steps is controlled by a combination of the two modules M2 and M3, these being configured so that at least one vocal signal processed by the criterion in question is received at the input of the module M2 and at the first input of the module M3 respectively, and that the output signal of the module M2 is then transmitted to the second input of the module M3.
  • the output signal of the module M3 then represents a signal representative of a division of the vocal signal into respective silence and speech time zones which is then transmitted to the other modules of the criterion in question.
  • a criterion amputated from the combination of these two modules also forms the subject of the present invention provided that processing of the vocal signal carried out upstream of the criterion in question makes it possible to deliver a signal representative of a division of the vocal signal into respective silence and speech time zones in a manner substantially identical to that of said combination of the modules M2 and M3.
  • These criteria, and others, may be used individually so as to obtain a quality level of a vocal signal or of a vocal conversation depending on the criterion in question. These criteria, and others, may be used jointly to obtain various quality levels of a vocal signal or of a vocal conversation depending on the criteria in question, and can thus have in the end a set of parameters that define a certain vocal quality.
  • the quality level of a signal or of a vocal conversation according to one or more quality criteria may be measured by evaluating it over time and thus the change in the quality of a signal or of a vocal conversation over the course of time according to the quality criteria in question may thus be seen.
  • an additional step is added after implementation of a given criterion on the basis of one or more vocal signals as input, during which additional step a given module or a given combination of additional given modules, comprising, as input, at least the delivered signal representative of the quality level of the vocal signal according to the given quality criterion and delivering, as output, a signal representative of a diagnostic associated with the quality level according to the given quality criterion represented in the input signal is employed.
  • this additional step it is thus possible to automatically diagnose a vocal condition, according to the quality criterion in question, on the basis of the quality level of the vocal signal, so as to know whether the level is, for example, good, moderate or poor as regards the quality criterion in question.
  • a diagnosis is found after implementation of a transmission of at least one output signal of the quality criterion in question of a vocal signal to the input of a module M6, the stored categories of which are diagnostics associated respectively with quality level intervals according to the quality criterion in question, the output signal of the module M6 is then representative of a diagnostic for which the level interval that is associated with it comprises the quality level of the vocal signal.
  • a vocal tonicity diagnostic is found after implementation of a transmission of signals delivered by the vocal tonicity criterion C6 to a set of modules consisting of the modules M7, M8 and M10, the stored categories used during the comparison step over the course of the implementation of the module M10 are diagnostics delimited by representative quantities of given levels according to the vocal tonicity criterion C6, each quantity depending on an input sound category of the module, the vocal tonicity criterion C6 and the modules M7, M8 and M10 being configured so that the vocal signal is furthermore transmitted to the first respective inputs of the modules M7 and M8, the output signal of the module M3 of the vocal tonicity criterion C6 is furthermore transmitted to the second input of the module M7, the output signal of the module M7 is then transmitted to the second input of the module M8, the output signals of the module M8 and of the module M9 of the vocal tonicity criterion C6 are then respectively transmitted to the second and first inputs
  • a thresholding operation is thus carried out on the vocal tonicity, with threshold levels dependent on a pronounced sound, such as a vowel.
  • a diagnostic signal for a quality criterion of a vocal signal may then be stored in memory and/or transmitted to at least one display means capable of interpreting the vocal diagnostic signal level so as to visibly display the level of the diagnostic.
  • a quality level signal for at least one portion of at least one vocal signal according to a given quality criterion may be stored in memory and/or transmitted to at least one display means capable of interpreting the signal level so as to visibly display the quality level according to the quality criterion to which at least the portion of the vocal signal belongs.
  • a signal representative of a diagnostic of a given criterion is provided following the implementation of a module Ml for the criterion in question, here C, and of a diagnostic module according to the given criterion M6.
  • the module M6 possesses three types of diagnostic such as, for example: good as 1, moderate as 2 and poor as 3.
  • the person who has emitted the vocal signal may be directed at Ol toward suitable exercises.
  • sensitive tasks Tla, T2a or T3a followed by respective vocal tasks Tib, T2b, T3b represent exercises provided according to whether the diagnostic issued gives, for example, a good, moderate or poor result, respectively.
  • This orientation Ol may, in one particular situation, be carried out automatically by associating with each stored diagnostic at least one proposal of vocal exercises tailored to the stored diagnostic.
  • the signal representative of the diagnostic provided on the basis of at least one portion of at least one vocal signal is then accompanied by the emission of a signal representative of the proposal of vocal exercises that is associated with the diagnostic provided.
  • the signal representative of the proposal of vocal exercises that is associated with the diagnostic provided is transmitted to at least one display means capable of interpreting the level of the signal so as to visibly display the proposal of vocal exercises that is associated with the diagnostic provided.
  • the progress made in the voice over the course of the exercises on the basis of the quality criterion in question may then be applied at A, thus completing at 20 the training procedure.
  • Such vocal analyses may be carried out at one particular time or regularly, thus allowing individuals to be able to test, work on or further control their voices.
  • Diagnostics in the form of warnings in real time may advantageously be produced, so that individuals exercising their voices can be informed immediately or subsequently of any defect in their voices, and can try to correct this after the exercise or in real time.
  • a multi-criterion warning may be calculated by addition of the single-criterion warnings.
  • the analyses and/or the vocal exercises proposed may be carried out locally or remotely, using remote communication means such as the Internet, Minitel, telephone, etc. Examples of exercises tailored to diagnostics issued as output of the vocal analyses are presented below:
  • vocal fatigue A voice fatigued by excessively intensive use, repeated shouting, intensive consumption of tobacco, a psychological shock or a generalized state of fatigue will increase breathiness in the vocal signal.
  • the voice is not pure and in particular it is this greater or lesser amount of breath that will result in several types of medication according to the following diagnostics:
  • Example 2 Vocal presence The notion of presence of low frequencies in the voice is evoked here. Whatever the vocal register of the person, chest resonances are present. In contrast, a lack of bass resonance in the voice gives an impression of a thin or "green” voice.
  • Several medications are "prescribed" according to the following diagnostics:
  • the exercises will promote the retention of this presence over the entire register and enrich it by monitoring the posture of his body during the exercises.
  • the criterion evokes the speed of elocution.
  • the listener/interlocutor is flagging and irritated. Reading exercises based on phrases or ends of phrases that are simple but repeated more and more quickly after warming up the voice will succeed in increasing the rate of elocution.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Method of analyzing at least one vocal signal, characterized in that it is implemented by elementary signal processing operations managed by respective modules, each module being capable of converting at least one module input signal into a module output signal representative of a given characteristic of the module input signal, and in that it involves the use of a means for processing the signal from a given module or from a given combination of given modules that receive, as input, at least one vocal signal and deliver, as output, a signal representative of at least one quality level of the vocal signal according to a given quality criterion. This invention also relates to a voice control training method and to the protection of quality criteria of a vocal signal as defined in the present document.

Description

ANALYSIS OF THE VOCAL SIGNAL QUALITY ACCORDING TO QUALITY
CRITERIA
FIELD OF THE INVENTION
The present invention relates to a method of analyzing at least one sound signal, in particular for extracting characteristics therefrom.
The aim of the present invention is more particularly to analyze one or more voices taken by themselves or in conversation. BACKGROUND OF THE INVENTION
Over recent decades, technological developments have allowed the field of voice analysis to be advanced, especially by voice signal processing.
Thus, especially by means of increasingly high-performance digital processing, it is possible to isolate certain fundamental characteristics of the voice, such as the fundamental frequency, the harmonics, the partials, the timbre of the voice, the pitch of the voice, the sound volume of the voice, etc, as disclosed for instance in US 5,799,276. Certain methods transform voices by removing characteristics therefrom or by modifying characteristics thereof.
Other methods are suitable for carrying out voice recognition. Certain other methods make it possible to create voices by forming the associated vocal characteristics. Techniques therefore exist for controlling these vocal characteristics, which define a theoretical voice of a lambda individual.
However, in practice, a voice is not static and it changes according to many somewhat random parameters, such as the time of day, the weather, one's moods, emotions, health, lifestyle, etc. The need to control one's voice, whatever the circumstances, has become increasingly relevant, especially in certain fields in which the vocal instrument assumes a great importance, such as those of television actors, conference speakers, singers, etc.
The need to work on one's voice in order to optimize it, for example for the purpose of producing an effect on one's interlocutor so as to convince him, captivate him or move him, may also be of great use in certain situations. Being able to control these variable parameters may then also serve for implementing more sophisticated voice recognition procedures, useful especially in the field of security, or for carrying out operations on voices or vocal creations closer to reality. Thus, documents GB 2,345,183 and US 4,377,158 propose systems and methods automatically refinding vocal informations included in a vocal signal.
Thus, US-2002/0010587 teaches us a system, a method and an article attempting to detect edginess in the voice.
Document WO 01 16,938 proposes a system, a method and an article that appear to be capable of detecting certain emotions in a voice.
Document US 6,182,044 discloses a system and a method that seem capable of detecting a vocal performance relative to a predetermined vocal model.
Document US 6,397,185 attemps to refmd intonation and rhythm in a vocal signal. These techniques seem to define certain vocal criteria which represent complex and changing voice parameters and which can help to give an idea as regards the condition of a voice at a given instant.
However, these few parameters seem to be insufficient and too isolated to provided a satisfactory diagnostic of the condition of the voice, and in general of the quality of the voice at a given moment.
SUMMARY OF THE INVENTION
A first main objective of the present invention is to measure a quality level of a voice according to one or more voice quality criteria.
A second main objective of the present invention is to measure a quality level of a conversation between various voices according to one or more conversation quality criteria.
A third objective is to diagnose the condition of a voice according to the measured quality levels of a voice.
A fourth objective is to choose exercises tailored to the diagnostics provided. In particular to achieve these objectives, the invention provides a method of analyzing at least one vocal signal, characterized in that it is implemented by elementary signal processing operations managed by respective modules, each module being capable of converting at least one module input signal into a module output signal representative of a given characteristic of the module input signal, and in that it involves the use of a means for processing the signal from a given module or from a given combination of given modules that receive, as input, at least one vocal signal and deliver, as output, a signal representative of at least one quality level of the vocal signal according to a given quality criterion. BRIEF DESCRIPTION OF THE DRAWINGS
Further aspects, objects and advantages of the present invention will become more clearly apparent upon reading the following detailed description of a preferred embodiment of the invention, given by way of non-limiting example and with reference to the appended drawings in which: - figure 1 shows a list of elementary vocal signal processing modules according to the invention; figure 2 shows a list of criteria based on the quality of a vocal signal according to the invention; figure 3 shows a diagram of a modular configuration of a speech content criterion according to the invention; figure 4 shows a diagram of a modular configuration that can provide a diagnostic of the vocal state of a vocal signal according to a speech content criterion according to the invention; figure 5 shows a diagram of a modular configuration of a common long silence content criterion according to the invention; figure 6 shows a diagram of a modular configuration that can provide a diagnostic of the state of a conversation between vocal signals according to a long silence content criterion according to the invention; figure 7 shows a diagram of a modular configuration of a criterion based on the number of long silences of a given vocal signal according to the invention; figure 8 shows a diagram of a modular configuration capable of providing a diagnostic of the state of a conversation between vocal signals according to a criterion based on the number of long silences of a given vocal signal according to the invention; figure 9 shows a diagram of a modular configuration of a criterion based on the number of speech interruptions of a first signal according to the invention; figure 10 shows a diagram of a modular configuration that can provide a diagnostic of the state of a conversation between vocal signals according to a criterion based on the number of speech interruptions of a first signal according to the invention; figure 11 shows a diagram of a modular configuration of a speech rate criterion according to the invention; figure 12 shows a diagram of a modular configuration that can provide a diagnostic of a vocal signal according to a speech rate criterion according to the invention; figure 13 shows a diagram of a modular configuration of a vocal tonicity criterion according to the invention; figure 14 shows a diagram of a modular configuration that can provide a diagnostic of the vocal state of a vocal signal according to a vocal tonicity criterion according to the invention; figure 15 shows a diagram of a modular configuration of a vocal presence criterion according to the invention; figure 16 shows a diagram of a modular configuration that can provide a diagnostic of the vocal state of a vocal signal according to a vocal presence criterion according to the invention; figure 17 shows a diagram of a modular configuration of a vocal nasality criterion according to the invention; figure 18 shows a diagram of a modular configuration that can provide a diagnostic of a vocal signal according to a vocal nasality criterion according to the invention; figure 19 shows a diagram of a modular configuration of a voice correctness criterion according to the invention; figure 20 shows a diagram of a modular configuration that can provide a diagnostic of the vocal state of a vocal signal according to a voice coπectness criterion; figure 21 shows a diagram of a modular configuration of a voice intonation criterion according to the invention; - figure 22 shows a diagram of a modular configuration that can provide a diagnostic of the vocal state of a vocal signal according to a voice intonation criterion according to the invention; and figure 23 shows a voice control training method according to the invention. DETAILED DESCRIPTION OF THE INVENTION
A sound signal is a continuous acoustic pressure wave propagating in time and in space, generated by a sound source.
A vocal signal is a sound signal emitted directly or indirectly by a human being or by an animal. For the purpose of the invention, the sound signals to be studied relate particularly to those emitted by a human being. The vocal source to be analyzed may be: vibrations of vocal chords of one or more individuals, therefore emitting a voice directly; or - the play-back of a voice recording; or a vocal signal obtained after an artificial vocal creation, that is to say based on non-living devices or instruments capable of creating human voices.
In the second case, the recording may be produced on any recording medium, such as an audio tape, a CD ROM, a hard disk, a floppy disk, etc. The recording format may be analog or digital, such as for example the WAV digital format.
In the case of a vocal source giving an analog vocal signal, the analog signal is denoted by S(t) and is a real signal being output continuously within the time period between 0 and T, measuring the acoustic pressure emitted by one or more vocal sources at each instant t.
This analog vocal signal may, for example, be received by an acoustic microphone which then converts the acoustic information into electrical information in order to be able consequently to carry out signal processing using electrical and/or electronic means such as electronic processors and memories.
The processing of the signal may then be carried out in an analog manner or digitally.
In the examples that will be described below, we will examine cases of digital signal analysis. However, the invention is not in any way limited to this type of analysis and can also extend to analog analyses of vocal signals.
To digitize an analog vocal signal, the technique widely employed is analog signal sampling, the samples being advantageously taken over time in a regular fashion, each time interval separating two consecutive signal sampling events being defined by a sampling period Te, a sampling frequency Fe being equal to 1/Te; the sampled signal, denoted by s, is then defined by: s(k) = S(kTe) k representing a set of positive integers between 0 and K-l; K being an integer giving the number of sampled points, of temporal extent KTe. The chosen sampling frequency within the context of the invention is preferably
8000 Hz or 11025 Hz in order to have a satisfactory resolution of a human voice.
The signal analysis of the present invention is usually carried out locally; analyses will therefore be preferentially carried out on signal portions that will be isolated within weighting windows. To isolate a portion of the signal, the signal is multiplied by a compact support function, which is, more precisely, zero outside the time interval of study, also called a weighting function w(k), k representing a set of positive integers between 0 and M-l and M being an integer giving the number of points contained within the weighting window of temporal extent MTe.
The instants of signal analysis are denoted by t|, 1 representing a set of positive integers between 0 and L-1 and L being an integer giving the number of instants of analysis.
In the case of analyses regularly spaced apart, Ta denotes the period of analysis and Fa=l/Ta denotes the frequency of analysis.
The number of points separating two successive instants of analysis is Δ=Ta/Te. Advantageously, Δ is less than or equal to M so as to have at least one analysis per weighting window.
The instant of analysis is preferably chosen as the middle of the weighting window; in this case, the instants are defined by: t,= (lΔ+(M-l)/2)Te. Such an analysis, called a short-term analysis, gives, on the basis of the sampled signal s(k), a series of temporal bounded-support signals, called frames, defined by: s,(k) = w(k) s(lΔ+k), k representing a set of positive integers between 0 and M- 1 ; 1 representing a set of positive integers between 0 and L-1; M being the size of each frame; w being the form of the weighting window;
Δ being the shift between two successive frames (in terms of number of points); if Δ = M, each point of the signal s is in a single frame.
The frames are centered on the instants of analysis: t, = (lΔ+(M-l)/2)Te. An analysis may, for example, use the following analytical parameters: - w(k)is of the Harming function type, i.e. w(k) = V (l-cos(2τ/M)); - the signal quantity in each frame is MTe = 0,04 s, i.e. M = 441 points; and
- frame overlap duration: 0 s, i.e. Δ = M. Therefore, Ta = 0.04 s et Fa = 25 Hz are obtained.
The digital signal is then processed and analyzed directly, or is recorded in an electrical or electronic memory to be analyzed later.
The analysis of a vocal signal does not refer only to temporal analysis of the vocal signal, but also to frequency analysis.
A short-term frequency analysis of the signal is advantageously carried out by applying, to the temporal frames, a Fourier transform, also called FFT. A frequency signal s, at a given instant of receiving the vocal signal is then obtained:
s,(n) = ∑ e-j2mk/Ns,(k) , k=0 n representing a set of positive integers between 0 et N-l ;
N being an integer giving the number of points of the FFT; and s,(n) representing the frequency signal analyzed at the frequency fn, fn = nFe / N. The frequency resolution, or frequency spacing, of the signal is given by the expression FJN.
If N is greater than the number of points of the frame M, the frame is advantageously supplemented with zeros until obtaining the N points needed for the calculation of s,(n). The modulus |sι(n)| represents the mean intensity over frame 1 of the frequency nFe / N, and constitutes the spectrum of the signal.
A logarithmic scale is then commonly employed to represent this spectrum in decibels, namely: 201og10|s](n)|.
The parameters of the frequency analysis are preferably the following: - N = 4096 ;
- Fe= 11025 Hz ;
A frequency spacing Fe/N equal to 2.6 Hz is then obtained, which is a value small enough to make it possible to distinguish, within the spectrum of the similar vocal frequencies, a frequency of a human voice, which may possibly vary from about 70 to about 1100 Hz.
A representation in terms of gray levels of the signal, with the instants of analysis ti plotted on the x-axis and the frequencies nFJN plotted on the y-axis and the amplitudes in dB represented in gray levels will be called here a "spectrogram".
The spectral signal and the temporal signal obtained directly from the original vocal emission then constitute the raw material on the basis of which signal analyses will be carried out in order to extract the desired characteristics therefrom.
The signal analysis methodology that will be used here is based on elementary signal processing steps controlled by respective modules.
One module, stored in memory, usually represents an algorithm of the conversion of at least one input signal into an output signal representative of a given characteristic of the input signal.
An electrical or electronic device, such as a processor, is advantageously used in the signal analysis method to recover the vocal signals, carry out analytical calculations on the signals on the basis of the modules stored in memory and recover the signals representative of information coming from the vocal analysis calculations in order to store this information in memory and/or send it to a communication means capable of communicating this information to a person in a format comprehensible by that person, such as a graphics display format using a screen as medium.
Figure 1 shows a module identified by a number (e.g. Ml) which will be adopted in the rest of the document. The description of the modules is of the input/output type, namely inputs on the left of the module and output on the right of the module.
In the following paragraphs, we will give a few modules advantageously used in a method according to the invention:
- digitized signal access module Ml:
Whether the vocal signal was emitted in an analog fashion or was output from a digital recording, the analysis of the signal necessarily starts by using this module Ml . This use of the module allows the vocal signal to be processed in order to have, as output from the module, a digital acoustic pressure signal, propagating over discretized time, characterized by a sampling frequency.
Advantageously, the discretized time has its values within the real interval [-1;1].
- background-noise and speech level estimation module M2. the use of which comprises the steps consisting in: receiving, as module input, a vocal signal; delivering, as module output, a signal representative of at least one maximum background noise level threshold and a minimum speech level threshold of the vocal signal received as module input. The minimum speech level threshold is generally found from the background noise level threshold increased by a certain value, which may be zero in certain cases. The sole signal estimate remaining to be made is then an estimate of the background noise.
The estimation of the background noise is a necessary step in order to be able to distinguish, in a vocal signal, "that which is heard from that which is not heard".
The expression "that which is heard" means here that which emerges sufficiently from the background noise.
In a first method of determining the background noise, the background noise is estimated from a recording without any voice.
This recording is advantageously used shortly before the start of the emission of the vocal signal that it is desired to analyze and under substantially the same conditions so that the background noise does not change significantly, and therefore so that the background noise data recorded is substantially identical to the background noise data of the vocal signal.
The recorded noise signal, denoted St>(t), with a time parameter t that is between 0 and T, is advantageously digitized using the method described above, delivering a digital temporal signal Sb(k) and a digital frequency signal §t.ι(n). It should be noted that the background noise measurement time T must be long enough for the statistics employed to be meaningful.
The background noise, denoted by bgn(n), is advantageously estimated as a maximum envelope of the spectrum, frequency by frequency. The background noise bdf(n) is then especially a function of:
- the mean amplitude of the L frames of the spectrum at the frequency nFe/N, which is denoted by mb(n) and given by:
Figure imgf000012_0001
- the standard deviation of the amplitude of the L frames of the spectrum at the frequency nFe/N, which is denoted by σ(n) and is given by:
1 L_1 σ2 (n) = ∑ (|sbi(n)|-mb(n))2. ^ 1=0
The expression for the background noise at the frequency nFe/N is then advantageously given by:
Figure imgf000012_0002
a being a multiplying coefficient, which is to be chosen. It may in particular have to be linked to a certain threshold value of a Gaussian distribution.
For example, a multiplying coefficient of 2 is advantageously linked to a threshold of a Gaussian distribution in which 2.5% of the samples exceed this threshold.
In a second method of determining the background noise, the background noise is estimated directly on the recording of the vocal signal, and not on a separate recording without a voice, as in the case of the first method of determining the background noise.
To do this, a first range of the recording of the vocal signal contains a recording of the silence, as it had been done during the first method of determining the background noise, for a typical duration of a few seconds, followed directly by a recording of the signal containing the vocal information in a second range of the recording. A first step of determining the background noise consists in separating the silence range from the non-silence range.
A second step of determining the background noise is then identical to the first method of determining the background noise. - silence zone and speech zone segmentation module M3 the use of which comprises the steps consisting in:
> receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of respective background noise and speech thresholds of a vocal signal; and > delivering, as module output, an output signal representative of a division of the input vocal signal into respective silence and speech time zones, the silence being at least partly defined by the background noise, the output signal having a given signal level for the silence zones and another given signal level for the speech zones . The output signal is advantageously a binary signal, for example with a 0 signal level assigned to the silence zones and a 1 signal level assigned to the speech zones.
Once the background noise has been determined, this module is therefore used to recognize the silence zones from the speech zones in the vocal signal.
The zones of the temporal signal that have an amplitude and/or an intensity above a defined threshold value or several defined threshold values are regarded as constituents of the vocal information.
The other zones of the temporal signal are regarded as silence zones in the vocal signal.
The module thus acts as a vocal signal filter, especially with reference to the background noise signal (thus representing a "silence" reference in the vocal signal) in order to distinguish the spoken sound from its noise sound, and thus separating the speech zones from the silence zones.
Analyses carried out after segmentation of the signal into speech and silence zones may thus be created and carried out, such as signal overlap, silence duration and speech duration analyses, or other analyses for identifying, for example, speech zones that would correspond in fact to noise zones, such as for example smacking of the lips, and noise zones that would correspond to speech zones.
Thus, modules such as the following five modules may be used after module M3: - level occupancy module M4A, the use of which comprises the steps consisting in: receiving, as module input, a temporal signal divided amplitudewise into at least two levels; and
> delivering, as module output, an output signal representative of the temporal occupancy of the temporal signal in a given signal level.
Thus, it is possible to obtain, for example, a duration of a binary input signal sent at the 1 level relative to the total duration of the input signal.
If the 1 level corresponds to the speech zones of a vocal signal, the module output signal is then representative of the speech content in the vocal signal. - common level occupancy module MB4, the use of which comprises the steps consisting in: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudwise into at least two levels; and
> delivering, as module output, an output signal representative of the simultaneous temporal occupancy of the temporal signals in a given signal level.
It is thus possible to obtain, for example, a time spent simultaneously by two binary input signals at the 0 level relative to the total duration of the input signals.
If the 0 level corresponds to the silence zones of the vocal signals, the module output signal is then representative of the amount of silence occupied in common by the vocal signals.
- module M4C for the number of long level intervals, the use of which comprises the steps consisting in: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the number of long time intervals within a given signal level of at least one temporal signal, an interval becoming long, on the basis of a stored threshold interval value, after a time interval of at least one other temporal signal in a level other than the given level.
It is thus possible to obtain, for example, in the case in which there are two binary input signals, a number of long intervals of 0 level of a first signal preceded by a level 1 interval of the second signal.
If the 0 level corresponds to the silence zones and the 1 level corresponds to the speech zones of the two vocal signals, the module output signal is then representative of the number of long silence intervals of the first signal that follow the speech intervals of the second signal.
- module M4D for the number of level overlaps, the use of which comprises the steps consisting in: 5* receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the number of time intervals for which at least two signals have respectively the same given signal level, at least one of these signals not having this given level after the interval and at least one other of these signals not having this given level after the interval.
It is thus possible to obtain, for example, in the case in which there are two binary input signals, the recorded overlaps relating to the intervals of a first signal terminating when a 1 level interval of the second signal has started. If the 0 level corresponds to the silence zones and the 1 level corresponds to the speech zones of the two vocal signals, the module output signal is then representative of the number of speech overlaps of the first and second signals.
- module M5 for segmentation of the steady zones, the use of which comprises the steps consisting in: > receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of the division of a vocal signal into respective silence and speech time zones; and delivering, as module output, an output signal representative of a division of the input vocal signal into steady and nonsteady zones - a zone of the temporal vocal signal is steady if the portion of the signal that it contains is sufficiently distinct from the portions of the signal that are adjacent to the zone, and especially if there is a sufficient break between characteristics of the signal contained in the zone as zone output and/or input and characteristics of the portions of the signal that are adjacent to the zone, and such a break is sufficient if it is greater than a stored threshold break value, the output signal consisting of the vocal input signal with a given signal level replacing the silence zones and the nonsteady zones.
This module therefore identifies the steady zones of the signal by statistical estimation of the model break type.
The stored model may be an identification of a sound or of a voice pitch or the like.
This module is used in particular to identify phonemes in a vocal signal.
After the abovementioned modules have differentiated the speech zones from the silence zones of the vocal signal and possibly determined the behavior and durations of the various zones, those parts of the vocal signal corresponding to the speech zones may then be analyzed so as to determine the vocal quality of this signal.
This is carried out in particular by the following modules:
- sound pitch module M7. the use of which comprises the steps consisting in: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of a division of a vocal signal into respective silence and speech time zones; and delivering, as module output, an output signal representative of the respective local fundamental frequencies of each speech zone of the vocal signal. The sound pitch corresponds to the fundamental frequency perceived at each instant.
This module detects the pitch in the various time frames of each speech zone. The processing associated with this module advantageously takes place in two steps:
- instant-by-instant detection of the fundamental frequency and of its amplitude, preferably using a probabilistic method; and
- elimination of the points that include pitch characteristics but are not pitches. The first processing step comprises firstly detection of the partials, each partial being a sinusoidal temporal component of the vocal signal represented by spectral lines. It should be noted that the spectral lines are broad and may also possess secondary lobes as a result of convolutions of the temporal signal by the weighting function chosen for the analysis. Detection of the partials takes account of:
- the "background noise" data; and
- the spectrogram of the vocal signal.
The center of a partial is defined here by a strict local maximum of the spectrum which: - emerges sufficiently above the background noise;
- is sufficiently high relative to the highest partial of the spectrum; and
- is not masked by the other centers of the partials.
If such a maximum does not exist, the partial does not exist.
The start of the partial generally corresponds to the lowest local minimum to the left of the center of the partial within a size limit imposed by the width of the weighting window.
If no local minimum is found, the start is advantageously denoted as being the boundary point of the weighting window. The algorithm used in this module uses especially curve mask techniques to isolate the partials.
A peak of a partial is thus characterized by:
- a peak start index; - a peak middle index;
- a peak end index; and
- a peak mid-height.
The input data for the algorithm are:
- a signal x(n), n being an integer between 0 and N-l, constituting the amplitude of a spectrum of a frame of the temporal vocal signal, x(n) advantageously being expressed in decibels (i.e. x(n) = 201og10|S](n)| );
- a reference base y(n), n being an integer between 0 and N-l, being a spectrum constituting a floor value for detecting the peaks of the partials, and taking at least partly the spectrum of the background noise, y(n) advantageously being expressed in decibels (i.e. y(n) = 201og,0bdf(n)) ; and
- a mask (n), n being an integer between 0 and N-l, set to infinity in negative values or to a high negative value in absolute value, takes into account the amplitude of the mask induced by each peak detected, z(n) advantageously being expressed in decibels. Advantageously, the vocal signal x(n) as module input is smoothed.
After the smoothing operation, various envelopes representing signals that can constitute the vocal signal are then sought.
Finally, the envelope having the maximum amplitude is determined, which will define the main lobe or peak of analysis weighting window, which will constitute a reference in the rest of the analysis.
The rest of the analysis depends in particular and advantageously on the following algorithm parameters: - [Fmin ; Fmax] which is a desired frequency interval of the partials bounded by a minimum frequency (Fmjn) and a maximum frequency (Fmax); the interval is preferably chosen to represent the entire frequency band available;
- E, which is the minimum emergence of a peak above the reference signal y; the minimum emergence is preferably zero;
- D, which is a half-width, that is to say the distance separating the start (or the end) of the peak from its center; it is preferably set as the half-width of the main lobe of the FFT of the weighting window;
- A, which is the attenuation of the mask at the distance D from the center of the peak; the attenuation is preferably set by the attenuation of a secondary lobe with respect to the main lobe of the FFT of the weighting window increased by 5 dB;
- P, which is the multiplicative slope in dB/octave of the mask of each peak; it is preferably set as being the attenuation slope of the secondary lobes and thus depends in general on the weighting window; and - H, which is the minimum height of a peak and relative to the highest peak; the minimum height is preferably set at 60 dB - a deviation from the height of a peak by more than 60 dB relative to the height of the main peak therefore means a nonextended peak.
The circulation steps carried out by the algorithm may be, for example, the following in succession:
1. For each peak, a search is made from the left of the peak toward its right (for n varying from a value corresponding to Fmin to a value corresponding to Fmax) 1.1. If n is a strict local maximum for * which emerges sufficiently from the base (that is to say x(n) > y(n) + E) and from the mask (that is to say x(n) > z( ) ), then:
1.1.1. n is adopted as being the middle of a peak;
1.1.2. the start of the peak is then sought, starting from the middle of the peak, without going beyond the half-width, for an integer y varying from n - 1 to n -D : 1.1.2.1. if j is a local minimum of the signal x, j is the start of the peak;
1.1.2.2. otherwise, if / is sufficiently attenuated relative to / (that is to say x(j) < x(i) + A ), j is the start of the peak;
1.1.3. if the start of the peak has not been found, then it is set at a distance of a half- idth from the center of the spectrum (that is to say for j = n - D);
1.1.4. a symmetrical methodology is advantageously applied in order to find the end of the peak; its index is then denoted by k;
1.1.5. the mask is reset only away from the peak found, the new mask being the maximum between the old mask and the expected attenuation on the secondary lobes of the peak (partial) detected. This attenuation equal to A in j - D, possesses a slope of R (in dB/Hz) per octave and is symmetrical with respect to the middle of the peak; and
1.1.6. a new peak is sought (step i) after the end of the peak detected (that is to say n = k + \ ); 1.2. Otherwise, a new peak is sought (by repeating step i, with n = n + \);
2. Finally, only the peaks that emerge from the final mask (which is the final signal z calculated), and the height of which (that is to say the value of the middle of the peak) is at most H from the highest peak, are adopted.
The first processing phase is then based on a family of partials of a spectrum of the vocal signal, on the basis of which the module M7 carries out the following steps:
Extracted firstly from this family is a sufficiently energetic and populated subfamily, representative of the main harmonics of the human voice. To do this, the partials that emerge from the background noise by at least a value E(, typically equal to 5 dB, are selected. If this selection contains less than a defined minimum number of partials, typically equal to 3, or if the selection contains no partials emerging from the background noise by a value of at least E2, typically equal to 20 dB, then it is considered that the spectrum analyzed contains no pitch.
In the opposite case, it is this subfamily that is worked upon thereafter. An energy of the partials of the subfamily is then calculated.
Thus, for example, an energy threshold reference is set equal to 0 for the lowest partial and an energy ceiling reference is set equal to 1 for the highest partial, the height of the partial being found at the center of the partial, the respective energies of the other partials then lying between these two references.
For a partial to be considered as the partial corresponding to a fundamental frequency, taken at the center of the partial and denoted by f0, it must satisfy certain conditions.
These conditions are preferably the following: - the energy of the partial exceeds a threshold value, typically equal to 0.7 if the energies of the partials are considered to be between 0 and 1;
- the hypothetical fundamental frequency does not possess a subharmonic of rank 1; a subharmonic is of rank 1 if a partial containing fϋ/2 exists, and the center of which is located less than a certain frequency from fϋ/2, typically equal to 3 Hz, and the energy of which differs from the energy of the partial of the hypothetical fundamental frequency by less than a certain energy difference, typically equal to 20 dB; and
- the hypothetical fundamental frequency does not possess a superharmonic of rank 1; a superharmonic is of rank 1 if a partial containing fO*2 exists, the center of which is located less than a certain frequency difference from f0*2, typically equal to 3 Hz and the energy of which differs from the energy of the partial of the hypothetical fundamental frequency by less than a certain energy difference, typically equal to 20 dB.
If such a partial does exist, the first one (that is to say that representing the lowest frequency) is adopted and the fundamental is declared to be present. Next, for each partial of the total family of starting partials, its rank in the harmonics is then calculated (0 if no harmonic, k if k*fO is contained in the partial).
Finally, the pitch is re-estimated by interpolation of the positions of the centers of the "harmonic" partials in the ranks of these harmonics. Firstly, the harmonics of rank of below a certain value, typically 10, are selected. In the case in which the rank of the harmonics is less than this value, the pitch is not re-estimated. Otherwise, a re-estimation of the pitch is carried out.
This re-estimation of the pitch may, for example, be effected by assigning to fo the value: f0 = a + b , where a and b are the coefficients of the following linear regression of the frequencies of the harmonics selected according to their rank: y = ax + b + ε yt : frequency of the selected harmonics; , : rank of the selected harmonics; e: width of a partial of rank k between frequencies [F, ;F2] , and therefore corresponds to the limit of the permitted variance of the value of y relative to the theoretical value that would be found by applying the linear equation. The partial therefore contains the harmonic of rank k if ka + b e [E, ; F2 ] . The ranks of the partials are then re-estimated.
A second phase of the processing carried out by module M7 consists in removing overall nonconforming points and local nonconforming points.
Points are considered as being nonconforming relative to a predetermined norm, which may be an overall norm (that is to say over all the analysis windows) or a local norm (that is to say only over a single analysis window).
In an example establishing an overall norm, the mean m of the standard deviation σ of the pitches of the vocal signal, in 440 Hz semitone, that are obtained on a temporal family of spectra is calculated. These statistics are advantageously calculated after eliminating the X highest values and the Y lowest values, X and Y typically and respectively being equal to 0 and 1.
Next, an acceptance threshold is calculated: the accepted x values are then those for which: be - m\ > aσ ; a being a predetermined coefficient and advantageously chosen depending on the type of sound that is expected of the signal, or depending on a more ad hoc distribution model than the Gaussian model. a is typically equal to 4. One solution consists in constituting hard thresholds, corresponding to sound pitches that cannot be attained by a human being, or cannot be obtained owing to the profile of the speaker/singer, or that cannot be obtained owing to the demand placed on the speaker/singer.
Removal of the local nonconforming points makes it possible, for its part, to remove false pitch detections of the fO/2 or 2ft) style.
To do this, a method proposed here consists in examining, in sliding time slots, the pitches detected. The nonconforming points are identified by comparing the scanning window with a left-hand window (located immediately to the left of the scanning window) and a right-hand window (located immediately to the right of the scanning window). In order for there to be a local nonconforming point, it is then necessary for:
- the left-hand window to contain sufficient detected pitch; and
- the left-hand window to be stable and
- the right-hand window to contain sufficient detected pitch; and - the right-hand window to be stable; and
- the central window value to be far from the left-hand and right-hand values.
The processing described above for detecting the pitch is satisfactory in that when the pitch is declared to be detected by the algorithm, it corresponds very often to a heard pitch.
It is also possible to furthermore carry out processing corresponding to a declaration of the instants (or frames) in which the pitch is almost assuredly absent. This makes it possible to optimize in particular the detection of notes. - harmonic energy distribution module, the use of which comprises the following steps consisting in:
> receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of a division of a vocal signal into respective silence and speech time zones; and
> delivering, as module output, an output signal representative of an energy distribution according to the harmonics of the speech zones of the vocal signal.
This module identifies the energy distribution of each speech zone according to the various harmonics detected.
The mean energy of a speech zone of the vocal signal, dedicated to the harmonics, is the energy of the signal of the useful frequencies located within the harmonic partials, the useful frequencies of a speech zone being those of the parameterizable band [En •„ ;Fm ]. The mean is advantageously calculated over substantially all the spectra having a pitch. The signal is thresholded to zero below the background noise. The energy is an L2 norm on a linear spectrum (abs(FFT)). More precisely:
- if and bdf(n) are denoted by the respective amplitudes of the vocal signal and of the background noise of frame / at the frequency /. ; and
- if H, denotes the union of the frequency intervals corresponding to the set of harmonic partials of frame / of the signal; then x, (ή) , the component of the signal s emerging above the background noise, is defined by:
Figure imgf000024_0001
the total energy e(l) is defined by:
Figure imgf000024_0002
the energy of the harmonics eH (I) is defined by:
Figure imgf000025_0001
- sound volume module Ml 8. the use of which comprises the steps consisting m: receiving, as module input, a vocal signal;
> delivering, as module output, an output signal representative of a temporal distribution of the sound volume of the vocal signal. This module calculates the local sound volume of the audio signal as input. Other improvements in the abovementioned modules and/or additions of other modules to the above list of modules may in order to provide characteristics essential to the subsequent analyses and thus improve the processing of the vocal signal.
These vocal characteristics are then analyzed, according to quality criteria of the incoming vocal signal, by modules according to at least one of the two following methods: - by calculating, from these characteristics, quantities representative of the quality levels of the vocal signal according to given quality criteria; or
- by comparing these input characteristics with given characteristics stored in memory and representative of given vocal models, the algorithm for the calculations of this type of analysis being contained in one or more modules. In the latter case, the use of a given module or of a given combination of given modules after receiving, as module input, a vocal signal and/or a signal after processing of the vocal signal, delivers a module output signal representative of a classification of at least a portion of the vocal signal in a given category of a given vocal criterion, according to the following steps: - reception of at least one portion of the signal or signals representative of at least one quantity;
- comparison of the quantity with at least one characteristic stored quantity of a given category threshold of at least one vocal signal according to a given vocal criterion and defining at least two regions, each region being associated with a category of the quality criterion;
- deduction of a category of the vocal criterion to which the quantity belongs; and - transmission of a signal representative of the category of the quality criterion provided, to which the vocal signal belongs.
It is possible, for example, to define as vocal quality criterion a sound pitch criterion, the stored categories then representing various sound pitches associated with configured frequency intervals representative of a set of pitches of an audible signal. Thus, having stored a series of reference pitches, defining voice pitch categories, it is then possible, by comparing the pitch of a vocal signal with them, to find the voice pitches contained in this signal.
For example, it is possible to have the following sound pitch categories: deep (150 ; 250] Hz for example), medium ([275 ; 351] Hz for example), and medium-high ([351 ; 450] Hz for example), or else: bass, baritone, tenor, contralto, soprano, etc.
In more particular processing, the input vocal signal is a sung voice from which it is intended to determine the notes emitted based on more complex criteria. A note of the sung signal is in particular identified, but from a pitch, at the ends, (start and end of a note) of the pitch by a localization of the breaks in the pitch curve. These breaks coincide with the local maxima of the module of the derivative (that is to say the high-slope points of inflexion). These breaks are combined with the natural boundaries between notes, such as long undetected pitch ranges. It should be pointed out that notes sufficiently close together (in terms of time and pitch) are merged into a single note.
From the lists of notes stored and differentiated by these parameter types, it is possible to compare a sung note with a stored note so as to determine the correctness of the note sung with respect to the stored note model.
Likewise, it is possible to have a list of sets of notes, each set corresponding to a given vocal line or even to a given song, and then to compare the notes of the sung voice with these so as to determine the correctness of the vocal line sung with respect to the stored vocal line model.
Examples of modules involving a step of comparing a quantity of a vocal signal with at least one stored characteristic quantity representing a threshold between given categories according to a given vocal criterion are presented below:
— module M8 for classification in a given sound, the use of which comprises the steps consisting in: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of at least one local fundamental frequency corresponding to at least one respective speech zone of a vocal signal;
> comparing the vocal signal and the local fundamental frequency or frequencies with stored signal characteristics defining regions, each region being associated with a given sound category; deducing the sound category or categories to which the respective local fundamental frequency or frequencies of the input vocal signal belongs or belong; and; delivering, as module output, an output signal representative of the sound category deduced for each speech zone.
This module especially detects types of vowels present in the vocal signal, especially by means of the local pitch received on an input.
The various categories of vowels have been estimated by learning on the basis of examples of vowels pronounced with variable pitches.
The characteristics of these examples of vowels depend especially on the following two models: - a stored voice pitch model, already discussed earlier, also called a voice register, the voice pitch categories of which are defined by vocal frequency intervals;
- a model of voices, such as bright, nasal, stifled or deep voices, the characteristics of which especially include pitch levels and temporal envelope forms of the vocal signal representing a sound. Advantageously, the comparison step is carried out according to the following two main steps:
- comparison of the pitches of the input signal with stored pitches;
- deduction of the registers to which the respective pitches of the signal belong;
- comparison of the characteristics of the input signal with the stored voice models corresponding to the register deduced above; and
- deduction of a voice category corresponding to the deduced register and deduction likewise of a sound category. — module M20 for classification according to a given model (not shown in a figure), the use of which comprises the steps consisting in: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of at least one sound category of a vocal signal; > comparing the input vocal signal and the sound category or categories of the vocal signal with at least one stored signal quantity representing a leveled threshold, defining at least two regions, each region being associated with a given level depending on the given model type; deducing the level or levels to which the respective sound category or categories of the input vocal signal belongs or belong; and delivering, as module output, an output signal representative of the level or levels deduced from the vocal signal according to the given model.
— module Ml 6 for calculating a voice pitch difference relative to a voice pitch model the use of which comprises the steps consisting in: receiving, as module input, a signal representative of at least one fundamental frequency of a vocal signal;
> comparing the input fundamental frequency with a stored fundamental frequency model; deducing the frequency difference between the two fundamental frequencies; and
> delivering, as module output, an output signal representative of the deduced frequency difference between the two fundamental frequencies. This module calculates the separation between the input pitch of the module and a fixed pitch model.
The use of this module has already been discussed above.
- module Ml 7 for calculating the difference in voice intonation relative to a voice intonation model, the use of which comprises the steps consisting in: > receiving, as module input, a signal representative of a temporal change of at least one fundamental frequency of a vocal signal; comparing the temporal change of the input fundamental frequency with a stored intonation model;
> deducing the difference between the two intonations; delivering, as module output, an output signal representative of the deduced difference between the two intonations.
This module calculates the separation between the input pitch of the module and a fixed intonation model.
— module M6 for classifying a quantity of a vocal signal, the use of which comprises the steps consisting in:
> receiving, as module input, a signal representative of at least one quantity of a vocal signal; comparing the input quantity with at least one stored quantity defining at least two regions, each region being associated with a given category of a given vocal criterion; deducing the category to which each quantity of the input vocal signal belongs; delivering, as module output, an output signal representative of the deduced category or categories to which the respective input quantity or quantities belong or belongs.
This module thresholds each quantity that is presented to it as input. - module M10 for classifying a quantity of a vocal signal according to an input parameter, the use of which comprises the steps consisting in: receiving, as a first module input, a signal representative of at least one quantity of a vocal signal and, as a second module input, a signal representative of at least one category of a parameter of a vocal signal; > comparing the input quantity with at least one stored quantity, defining at least two regions, each region being associated with a given category of a given vocal criterion, the value of each stored quantity being a function of the input parameter or parameters; deducing the category to which each quantity of the input vocal signal belongs;
> delivering, as module output, an output signal representative of the deduced category or categories to which the respective input quantity or quantities belongs or belong.
This module automatically thresholds an input quantity according to an input parameter.
The use of such modules after the elementary signal processing steps controlled by the modules detailed above have been implemented thus deliver, as output, a vocal signal quality level according to given models or quality criteria.
Thus, a quality criterion of a given vocal signal will be defined by a set of given modules connected together in a given combination and receiving, as input, at least one vocal signal and delivering, as output, a signal representative of a vocal signal quality level according to a quality criterion given by the combination of the modules.
Two broad categories of criteria may be defined: - the vocal quality criteria of the input vocal signal, which give a quality level of the emitted voice; with reference to figure 2, this category comprises the following criteria: vocal tonicity C6, vocal presence C7, vocal nasality C9, voice correctness C12 and voice intonation C13; - the quality criteria of a conversation, a conversation involving an interaction of a number of separate, preferably synchronized, vocal signals which give a quality level of the conversation; with reference to figure 2, this category comprises the following criteria: speech content of one of the vocal signals of the conversation Cl, content of long silences common to vocal signals of the conversation C2, number of long silences in one of the signals of the conversation C3, number of signal level overlaps between vocal signals of the conversation C4 and speech rate of one of the vocal signals of the conversation C5.
- These various criteria are described one by one in the rest of this document:
- speech content criterion Cl : this comprises, as shown in figure 3, the modules M2, M3 and M4A (the module M4A giving a temporal occupancy of a temporal signal within a fixed speech level, these being configured so that a vocal signal), after having been processed by a module Ml, is received at the input of the module M2 and at the first input of the module M3, the output signals of the modules M2 and M3 are then sent to the second inputs of the modules M3 and M4A respectively, the output signal of the module M4A then being representative of the speech content in the vocal signal.
This criterion therefore makes it possible to obtain the speech time of the speaker relative to a signal duration.
- common long-silence content criterion C2: this corresponds, as shown in figure 5, to a number n of modules M2, n modules M3 and a module M4B with n inputs (the module MB4 giving a simultaneous temporal occupancy of n temporal signals within a fixed silence level), these being configured so that a number n of vocal signals (n is, in the example illustrated in figure 2, equal to 2 and the signals are labeled PI and P2) after having each been treated by a module Ml, are received, in the case of each of them, at a respective input of a module M2 and at a first input of a module M3 so that each module M2 or M3 receives only a single vocal signal, the output signal of each module M2 is then transmitted to the second input of the module M3 that has received the same vocal signal at its first input as that received by this module M2, each of the output signals of the modules M3 are then transmitted respectively to a single input of the module M4B so that each input of the module M4B receives only a single signal, the output signal of the module M4B then being representative of the content of long silences common to the n vocal signals.
This criterion makes it possible to obtain in particular the content of long silences common to n interlocutors in conversation. - criterion C3 for the number of long silences of a given vocal signal: this comprises, as shown in figure 7, two modules M2, two modules M3 and one module M4C having two inputs (the module M4C giving a number of long time intervals within a fixed silence level of a temporal signal), these being configured so that two vocal signals, after having each been processed by a module Ml, are each received at a respective input of a module M2 and at a first input of a module M3 so that each module M2 or M3 receives only a single vocal signal, the output signal of each module M2 is then transmitted to the second input of the module M3 having received the same vocal signal at its first input as that received by this module M2, each of the output signals of the modules M3 are then sent respectively to a single input of the module M4C so that each input of the module M4C receives only a single signal, the output signal of the module M4C then being representative of the number of long silences of one of the two vocal signals received.
The criterion output number therefore represents the number of time intervals corresponding to a silence of a first interlocutor after an intervention by the second interlocutor.
- criterion C4 for the number of speech interruptions of a first signal: this comprises, as shown in figure 9, two modules M2, two modules M3 and one module M4D having two inputs (the module M4D giving a number of time intervals for which two signals have respectively the same fixed speech level), these being configured so that two vocal signals, after having each been processed by a module Ml, are each received at a respective input of a module M2 and at a first input of a module M3 so that each module M2 or M3 receives only a single vocal signal, the output signal of each module M2 is then transmitted to the second input of the module M3 having received the same vocal signal at its first input as that received by this module M2, each of the output signals of the modules M3 are then respectively transmitted to a single input of the module M4D so that each input of the module M4D receives only a single signal, the output signal of the module M4D then being representative of the number of speech interruptions of one of the two vocal signals received. The criterion output number therefore represents the number of time intervals corresponding to a silence of a first interlocutor after an intervention by the second interlocutor.
- speech rate criterion C5: this comprises, as shown in figure 11, the modules M2, M3 and M5 which are configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3 and M5, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M5, the output signal of the module M5 then being representative of the speech rate level in the vocal signal. This criterion therefore makes it possible to measure the speech rate of a speaker. This rate is expressed in a unit proportional to the number of phonemes pronounced by the speaker.
- vocal tonicity criterion C6: this comprises, as shown in figure 13, the modules M2, M3 and M9 that are configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the respective first inputs of the modules M3 and M9, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M9, the output signal of the module M9 then being representative of the level of vocal tonicity in the vocal signal. This criterion measures the tonicity of the voice of a speaker, this being inversely proportional to the vocal fatigue.
The vocal tonicity here is directly associated with the energy in the voice - it may also be representative of a breathiness level in the voice. A breath is recognized if the voice is not pure, that is to say if it also expends energy also to generate background noise in addition to creating the desired sounds. It is especially by comparing the ratio of the energy of the vocal sound (that is to say the energy of the harmonic frequencies) to the nonvocal sound frequencies (that is to say the energy of the nonharmonic frequencies) that one happens to find a vocal tonicity level.
To diagnose this level, it is necessary also to take account of the emitted sound, such as a vowel, a particular emitted vowel generating, of course, a larger or smaller number of nonharmonic frequencies than another particular vowel.
One way of carrying out such a diagnosis will be discussed later in this document.
- vocal presence criterion C7: this comprises, as shown in figure 15, the modules M2, M3, M7, M8 and Mi l, the module Mil being a module M20 capable of classifying a vocal signal by level according to a given vocal presence model, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3, M7, M8 and Mil, the output signals of the modules M2, M3, M7 and M8 are then transmitted respectively to the second inputs of the modules M3, M7, M8 and Mil, the output signal of the module M20 then being representative of the vocal presence level in the vocal signal. This criterion measures the vocal presence of a speaker, that is to say a capability of a voice to hold the attention of its audience.
The vocal presence is especially determined by determining low frequencies in the signal.
- given voice model criterion: this comprises the modules M2, M3, M7, M8 and M20, the module M20 being capable of classifying a signal by level according to a given voice model, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of a module M2 and at the first respective inputs of the modules M3, M7, M8 and M20, the output signals of the modules M2, M3, M7 and M8 are then transmitted respectively to the second inputs of the modules M3, M7, M8 and M20, the output signal of the module M20 then being representative of a level of the voice model in the vocal signal. The given voice model is advantageously a vocal nasality.
We thus obtain the following criterion:
> vocal nasality criterion C9: this comprises, as shown in figure 17, a module Ml 3 which is the module M20 capable of classifying a signal per level of vocal nasality.
This criterion measures the vocal nasality level of a speaker.
- voice correctness criterion C12: this comprises, as shown in figure 17, the modules M2, M3, M7 and Ml 6, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3, M7 and Ml 6, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M7, the output signal of the module M7 is then transmitted to the input of the module Ml 6, the output signal of the module Ml 6 then being representative of a voice pitch difference in the vocal signal relative to a stored voice pitch model.
This criterion measures the correctness of the voice relative to a fixed model.
- voice intonation criterion C13: this comprises, as shown in figure 21, the modules M2, M3, M7 and Ml 7, these being configured so that a vocal signal, after having been processed by a module Ml, is received at the input of the module M2 and at the first respective inputs of the modules M3, M7 and Ml 7, the output signals of the modules M2 and M3 are then respectively transmitted to the second inputs of the modules M3 and M7, the output signal of the module M7 is then transmitted to the input of the module Ml 7, the output signal of the module Ml 7 then being representative of a difference in intonation in the vocal signal relative to a stored intonation model. This criterion measures the separation between the intonation of the voice of the speaker and that of a fixed model.
It should be noted that the use of the criteria described above, Cl, C2, C3, C4, C5, C6, C7, C9, C12 and C13, each comprise at least one initial signal processing step. Each of these initial signal processing steps is controlled by a combination of the two modules M2 and M3, these being configured so that at least one vocal signal processed by the criterion in question is received at the input of the module M2 and at the first input of the module M3 respectively, and that the output signal of the module M2 is then transmitted to the second input of the module M3. The output signal of the module M3 then represents a signal representative of a division of the vocal signal into respective silence and speech time zones which is then transmitted to the other modules of the criterion in question.
A criterion amputated from the combination of these two modules also forms the subject of the present invention provided that processing of the vocal signal carried out upstream of the criterion in question makes it possible to deliver a signal representative of a division of the vocal signal into respective silence and speech time zones in a manner substantially identical to that of said combination of the modules M2 and M3.
These criteria, and others, may be used individually so as to obtain a quality level of a vocal signal or of a vocal conversation depending on the criterion in question. These criteria, and others, may be used jointly to obtain various quality levels of a vocal signal or of a vocal conversation depending on the criteria in question, and can thus have in the end a set of parameters that define a certain vocal quality.
More broadly, the quality level of a signal or of a vocal conversation according to one or more quality criteria may be measured by evaluating it over time and thus the change in the quality of a signal or of a vocal conversation over the course of time according to the quality criteria in question may thus be seen.
In a preferred method according to the invention, an additional step is added after implementation of a given criterion on the basis of one or more vocal signals as input, during which additional step a given module or a given combination of additional given modules, comprising, as input, at least the delivered signal representative of the quality level of the vocal signal according to the given quality criterion and delivering, as output, a signal representative of a diagnostic associated with the quality level according to the given quality criterion represented in the input signal is employed.
By means of this additional step, it is thus possible to automatically diagnose a vocal condition, according to the quality criterion in question, on the basis of the quality level of the vocal signal, so as to know whether the level is, for example, good, moderate or poor as regards the quality criterion in question.
In one particular embodiment of modules, a diagnosis is found after implementation of a transmission of at least one output signal of the quality criterion in question of a vocal signal to the input of a module M6, the stored categories of which are diagnostics associated respectively with quality level intervals according to the quality criterion in question, the output signal of the module M6 is then representative of a diagnostic for which the level interval that is associated with it comprises the quality level of the vocal signal. By comparing the quality level of the signal with the stored levels, defining the stored level intervals, it is possible in the end to quantify a quality or a state of the vocal signal according to a scale of qualities or of states that is defined by these diagnostics and relates to the quality criterion in question.
Referring to figures 4, 6, 8, 10, 12, 16, 18, 20, 22, it is thus possible to have a diagnostic with regard to the quality of the vocal signal relating to the respective criteria for speech content Cl, common long silence content C2, number of long silences of a given vocal signal C3, number of speech interruptions of a first signal C4, speech rate C5, vocal presence C7, vocal nasality C9, voice correctness C12 and voice intonation C13. In another particular embodiment of modules, with reference to figure 14, a vocal tonicity diagnostic is found after implementation of a transmission of signals delivered by the vocal tonicity criterion C6 to a set of modules consisting of the modules M7, M8 and M10, the stored categories used during the comparison step over the course of the implementation of the module M10 are diagnostics delimited by representative quantities of given levels according to the vocal tonicity criterion C6, each quantity depending on an input sound category of the module, the vocal tonicity criterion C6 and the modules M7, M8 and M10 being configured so that the vocal signal is furthermore transmitted to the first respective inputs of the modules M7 and M8, the output signal of the module M3 of the vocal tonicity criterion C6 is furthermore transmitted to the second input of the module M7, the output signal of the module M7 is then transmitted to the second input of the module M8, the output signals of the module M8 and of the module M9 of the vocal tonicity criterion C6 are then respectively transmitted to the second and first inputs of the module M10, the output signal of the module M10 then being representative of a diagnostic associated with the vocal tonicity level of at least one portion of the vocal signal.
To produce a diagnostic for monitoring the vocal tonicity, a thresholding operation is thus carried out on the vocal tonicity, with threshold levels dependent on a pronounced sound, such as a vowel.
It is thus possible to rule on the tonal character, or on the contrary the tired character, of a voice.
A diagnostic signal for a quality criterion of a vocal signal may then be stored in memory and/or transmitted to at least one display means capable of interpreting the vocal diagnostic signal level so as to visibly display the level of the diagnostic.
Likewise, a quality level signal for at least one portion of at least one vocal signal according to a given quality criterion may be stored in memory and/or transmitted to at least one display means capable of interpreting the signal level so as to visibly display the quality level according to the quality criterion to which at least the portion of the vocal signal belongs.
It is also possible, and in the same manner, to monitor the time variation in the quality level of at least one portion of at least one vocal signal according to a given quality criterion.
In a more complete configuration, it is possible to display the time variation in the quality level of the vocal signal according to one or more given quality criteria by also having an indicator of the associated diagnostic, with for example gray levels associated with the respective various diagnostics. In this configuration in which the quality of the signal is defined by a certain number of criteria, it may then be envisioned to choose one or more particular processing operations suitable for correcting faults in the voice analyzed that are demonstrated by the diagnostics provided. One method of training the voice is given here, as shown in figure 23, in which, after emission of a vocal signal at 10 and its digitization carried out by a module M0, a signal representative of a diagnostic of a given criterion is provided following the implementation of a module Ml for the criterion in question, here C, and of a diagnostic module according to the given criterion M6. In this example, the module M6 possesses three types of diagnostic such as, for example: good as 1, moderate as 2 and poor as 3.
Depending on the result of the diagnostic, the person who has emitted the vocal signal may be directed at Ol toward suitable exercises.
Here, sensitive tasks Tla, T2a or T3a followed by respective vocal tasks Tib, T2b, T3b represent exercises provided according to whether the diagnostic issued gives, for example, a good, moderate or poor result, respectively.
This orientation Ol may, in one particular situation, be carried out automatically by associating with each stored diagnostic at least one proposal of vocal exercises tailored to the stored diagnostic. The signal representative of the diagnostic provided on the basis of at least one portion of at least one vocal signal is then accompanied by the emission of a signal representative of the proposal of vocal exercises that is associated with the diagnostic provided.
In the latter case, the signal representative of the proposal of vocal exercises that is associated with the diagnostic provided is transmitted to at least one display means capable of interpreting the level of the signal so as to visibly display the proposal of vocal exercises that is associated with the diagnostic provided.
The progress made in the voice over the course of the exercises on the basis of the quality criterion in question may then be applied at A, thus completing at 20 the training procedure. Such vocal analyses may be carried out at one particular time or regularly, thus allowing individuals to be able to test, work on or further control their voices.
Diagnostics in the form of warnings in real time may advantageously be produced, so that individuals exercising their voices can be informed immediately or subsequently of any defect in their voices, and can try to correct this after the exercise or in real time.
In particular, it is possible to monitor, throughout the day, the vocal quality of a group of individuals. The vocal quality is measured by a diagnostic on vocal impressions taken at a rate matched to each individual. In addition to the detailed diagnostic with regard to each criterion, a multi-criterion warning may be calculated by addition of the single-criterion warnings.
In the same way, the analyses and/or the vocal exercises proposed may be carried out locally or remotely, using remote communication means such as the Internet, Minitel, telephone, etc. Examples of exercises tailored to diagnostics issued as output of the vocal analyses are presented below:
Example 1 : Vocal tonicity
The notion of vocal fatigue is evoked here. A voice fatigued by excessively intensive use, repeated shouting, intensive consumption of tobacco, a psychological shock or a generalized state of fatigue will increase breathiness in the vocal signal. The voice is not pure and in particular it is this greater or lesser amount of breath that will result in several types of medication according to the following diagnostics:
- Moderate tonicity: slight fatigue.
An exercise, for better connecting breath and sound (use of a vowel of the "i" type), will suffice to correct this defect.
- Poor tonicity: real fatigue.
Low-sound-volume exercises, giving preference to incisive ("i" type) vowels over reduced intervals and reduced tessitura (few or no virtuoso exercises). - Very poor tonicity: pathological fatigue.
The fact of continuing to speak or to sing would result in aphonia and the vocal cords must be rested.
Example 2: Vocal presence The notion of presence of low frequencies in the voice is evoked here. Whatever the vocal register of the person, chest resonances are present. In contrast, a lack of bass resonance in the voice gives an impression of a thin or "green" voice. Several medications are "prescribed" according to the following diagnostics:
- Little vocal presence. A specific task of relaxing the larynx and the tongue, within what is called a
"chest" register, will help these bass resonances to come back. The task will then be to maintain the presence of these bass resonances in the rest of the vocal register.
- Moderate vocal presence.
The exercises will promote the retention of this presence over the entire register and enrich it by monitoring the posture of his body during the exercises.
- Strong vocal presence.
Firstly, a check is made that the voice is not too "deep" or too "chesty" to the detriment of the "engagement" of the sound throughout the head.
Example 3: Speech rate
The criterion evokes the speed of elocution.
Several medications are "prescribed" according to the following diagnostics:
- Too slow a rate.
The listener/interlocutor is flagging and irritated. Reading exercises based on phrases or ends of phrases that are simple but repeated more and more quickly after warming up the voice will succeed in increasing the rate of elocution.
- Too rapid a rate.
It is proposed that the person learning record his voice and listen to it. It is proposed that he read a text at an imposed rate (metronome or karaoke type). - Normal rate.
Take care to ensure that a constant rate is maintained, which could be wearying, and check that the learner can slow down or speed up his rate at will.

Claims

1. A method of analyzing at least one vocal signal, characterized in that it is implemented by elementary signal processing operations managed by respective modules, each module being capable of converting at least one module input signal into a module output signal representative of a given characteristic of the module input signal, and in that it involves the use of a means for processing the signal from a given module or from a given combination of given modules that receive, as input, at least one vocal signal and deliver, as output, a signal representative of at least one quality level of the vocal signal according to a given quality criterion.
2. The method of analyzing at least one vocal signal as claimed in the preceding claim, characterized in that at least one quality criterion provided is a vocal quality criterion.
3. The method of analyzing at least one vocal signal as claimed in the preceding claim, characterized in that at least one vocal quality criterion is included in the following list: vocal tonicity, vocal presence, vocal nasality, voice correctness, voice intonation.
4. The method of analyzing at least one vocal signal as claimed in one of the preceding claims, characterized in that at least one quality criterion provided is a conversation quality criterion, a conversation involving an interaction of a plurality of separate vocal signals.
5. The method of analyzing at least one vocal signal as claimed in the preceding claim, characterized in that the vocal signals of the plurality of vocal signals of the conversation are synchronized.
6. The method of analyzing at least one vocal signal as claimed in either of the two preceding claims, characterized in that at least one conversation quality criterion is included in the following list: speech content of one of the vocal signals of the conversation, content of long silences common to vocal signals of the conversation, number of long silences in one of the signals of the conversation, number of signal level overlaps between vocal signals of the conversation, speech rate of one of the vocal signals of the conversation.
7. The method of analyzing at least one vocal signal as claimed in one of the preceding claims, characterized in that at least one of the modules capable of controlling elementary signal processing operations and implemented by the signal processing means belongs to the following list:
- background-noise and speech level estimation module, also denoted by M2, the use of which comprises the following steps: receiving, as module input, a vocal signal; delivering, as module output, a signal representative of at least one maximum background noise level threshold and a minimum speech level threshold of the vocal signal received as module input;
- silence zone and speech zone segmentation module, also denoted by M3, the use of which comprises the following steps: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of respective background noise and speech thresholds of a vocal signal; and delivering, as module output, an output signal representative of a division of the input vocal signal into respective silence and speech time zones, the silence being at least partly defined by the background noise, the output signal having a given signal level for the silence zones and another given signal level for the speech zones; - level occupancy module, also denoted by M4A, the use of which comprises the following steps:
> receiving, as module input, a temporal signal divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the temporal occupancy of the temporal signal in a given signal level;
- common level occupancy module, also denoted by M4B, the use of which comprises the following steps: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the simultaneous temporal occupancy of the temporal signals in a given signal level;
- module for the number of long level intervals, also denoted by M4C, the use of which comprises the following steps: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and delivering, as module output, an output signal representative of the number of long time intervals within a given signal level of at least one temporal signal, an interval becoming long, on the basis of a stored threshold interval value, after a time interval of at least one other temporal signal in a level other than the given level;
- module for the number of level overlaps, also denoted by M4D, the use of which comprises the following steps: receiving, as at least two module inputs, at least two respective temporal signals, each divided amplitudewise into at least two levels; and
^ delivering, as module output, an output signal representative of the number of time intervals for which at least two signals have respectively the same given signal level, at least one of these signals not having this given level after the interval and at least one other of these signals not having this given level after the interval;
— module for segmentation of the steady zone, also denoted by M5, the use of which comprises the following steps: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of the division of a vocal signal into respective silence and speech time zones; and
> delivering, as module output, an output signal representative of a division of the input vocal signal into steady and nonsteady zones - a zone of the vocal signal is steady if the portion of the signal that it contains is sufficiently distinct from the portions of the signal that are adjacent to the zone, and especially if there is a sufficient break between characteristics of the signal contained in the zone as zone output and/or input and characteristics of the portions of the signal that are adjacent to the zone, and such a break is sufficient if it is greater than a stored threshold break value, the output signal consisting of the vocal input signal with a given signal level replacing the silence zones and the nonsteady zones;
- sound pitch module, also denoted by M7, the use of which comprises the following steps: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of a division of a vocal signal into respective silence and speech time zones; and delivering, as module output, an output signal representative of the respective local fundamental frequencies of each speech zone of the vocal signal; - harmonic energy distribution module, also denoted by M9, the use of which comprises the following steps:
> receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of a division of a vocal signal into respective silence and speech time zones; and > delivering, as module output, an output signal representative of an energy distribution according to the harmonics of the speech zones of the vocal signal;
- sound volume module, also denoted by Ml 8, the use of which comprises the following steps: receiving, as module output, a vocal signal; and delivering, as module output, an output signal representative of a temporal distribution of the sound volume of the vocal signal.
8. The method of analyzing at least one vocal signal as claimed in one of the preceding claims, characterized in that it comprises the use, by the signal processing means, of a given module or of a given combination of given modules comprising, as input, at least one vocal signal and/or a signal after the vocal signal has been processed and delivering, as output, a signal representative of a classification of at least one portion of the vocal signal in a given category of a given vocal criterion, according to the following steps:
- reception of at least one portion of the signal or signals representative of at least one quantity;
- comparison of the quantity with at least one stored quantity characteristic of a given category threshold of at least one vocal signal according to a given vocal criterion and defining at least two regions, each region being associated with a category of the quality criterion;
- deduction of a category of the vocal criterion to which the quantity belongs; and; - emission of a signal representative of the category of the quality criterion provided, to which the vocal signal belongs.
9. The method of analyzing at least one vocal signal as claimed in the preceding claim, characterized in that at least one module capable of classifying at least a portion of at least one vocal signal in a category of a given vocal criterion is included in the following list:
— module for classification into a given sound, also denoted by M8, the use of which comprises the following steps: receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of at least one local fundamental frequency corresponding to at least one respective speech zone of a vocal signal; comparing the vocal signal and the local fundamental frequency or frequencies with stored signal characteristics defining regions, each region being associated with a given sound category; deducing the sound category or categories to which the respective local fundamental frequency or frequencies of the input vocal signal belongs or belong; and;
> delivering, as module output, an output signal representative of the sound category deduced for each speech zone;
— module for classifying a given vocal nasality, also denoted by Ml 3, the use of which comprises the following steps:
> receiving, as a first module input, a vocal signal and, as a second module input, a signal representative of at least one sound category of a vocal signal; comparing the input vocal signal and the sound category or categories of the vocal signal with at least one stored signal quantity representing a level threshold, defining at least two regions, each region being associated with a given vocal nasality level; deducing the level or levels to which the sound category or categories of the input vocal signal belongs or belong; and delivering, as module output, an output signal representative of the deduced vocal nasality level or levels of the vocal signal; — module for calculating the voice pitch difference relative to a voice pitch model, also denoted by Ml 6, the use of which comprises the following steps: receiving, as module input, a signal representative of at least one fundamental frequency of a vocal signal; > comparing the input fundamental frequency with a stored fundamental frequency model;
> deducing the frequency difference between the two fundamental frequencies; and delivering, as module output, an output signal representative of the deduced frequency difference between the two fundamental frequencies;
- module for calculating the voice intonation difference relative to a voice intonation model, also denoted by Ml 7, the use of which comprises the following steps: receiving, as module input, a signal representative of a temporal change of at least one fundamental frequency of a vocal signal; > comparing the temporal change of the input fundamental frequency with a stored intonation model; deducing the difference between the two intonations; delivering, as module output, an output signal representative of the deduced difference between the two intonations; — module for classifying a quantity of a vocal signal, also denoted by M6, the use of which comprises the following steps: receiving, as module input, a signal representative of at least one quantity of a vocal signal; comparing the input quantity with at least one stored quantity defining at least two regions, each region being associated with a given category of a given vocal criterion; deducing the category to which each quantity of the input vocal signal belongs; delivering, as module output, an output signal representative of the deduced category or categories to which the input quantity or quantities belongs or belong;
- module for classifying a quantity of a vocal signal according to an input category, also denoted by M10, the use of which comprises the following steps: receiving, as a first module input, a signal representative of at least one quantity of a vocal signal and, as a second module input, a signal representative of at least one parameter of a vocal signal; comparing the input quantity with at least one stored quantity, defining at least two regions, each region being associated with a given category of a given vocal criterion, the value of each stored quantity being a function of the input parameter or parameters;
> deducing the category to which each quantity of the input vocal signal belongs; > delivering, as module output, an output signal representative of the deduced category or categories to which the respective input quantity or quantities belongs or belong.
10. The method of analyzing at least one vocal signal as claimed in one of the preceding claims, characterized in that it comprises the use, by the signal processing means, of a quality criterion of a stored vocal signal, a quality criterion of a given vocal signal being defined by a set of given modules connected together in a given combination and receiving, as input, at least one vocal signal and delivering, as output, a signal representative of a vocal signal quality level according to a quality criterion given by the combination of the modules.
11. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7 and 10, characterized in that it comprises the use, by the signal processing means, of a speech content criterion (Cl) of a vocal signal, the speech content criterion (Cl) comprising the module M4A, the module M4A giving a temporal occupancy of the temporal signal within a fixed speech level, configured so that a signal representative of a division of a vocal signal into respective silence and speech time zones is received at the input of the module M4A, the output signal of the module M4A then being representative of speech content in the vocal signal.
12. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7 and 10, characterized in that it comprises the use, by the signal processing means, of a common long-silence content criterion (C2) for a number n of vocal signals, the common long-silence content criterion (C2) comprising a module M4B having n inputs, the module M4B giving a simultaneous temporal occupancy of n temporal signals within a fixed silence level, configured so that n signals representative respectively of n respective divisions of the n vocal signals into respective silence and speech time zones is received at the input of the module M4B so that each input of the module M4B receives only a single signal, the output signal of the module M4B then being representative of the content of long silences common to the n vocal signals.
13. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7 and 10, characterized in that it comprises the use, by the signal processing means, of a criterion (C3) for the number of long silences of a given vocal signal of two vocal signals, the criterion (C3) for the number of long silences of a given vocal signal comprising a module M4C with two inputs, the module M4C giving a number of long time intervals within a fixed silence level of one of the temporal signals, configured so that two signals respectively representative of two respective divisions of the two vocal signals into respective silence and speech time zones is received at the input of the module M4C so that each input of the module M4C receives only a single signal, the output signal of the module M4C then being representative of the number of long silences of one of the two vocal signals received.
14. The method of analyzing at least one vocal signal according to one of the preceding claims combined with claims 7 and 10, characterized in that it comprises the use, by the signal processing means, of a criterion (C4) for the number of speech interruptions of a first signal of two vocal signals, the criterion (C4) for the speech interruptions of a first signal comprising a module M4D having two inputs, the module M4D giving a number of time intervals for which two signals have respectively the same fixed speech level, configured so that two signals respectively representative of two respective divisions of the two vocal signals into respective silence and speech time zones is received at the input of the module M4D so that each input of the module M4D receives only a single signal, the output signal of the module M4D then being representative of the number of speech interruptions of one of the two vocal signals received.
15. The method of analyzing at least one vocal signal as claimed in one of the preceding claims, combined with claims 7 and 10, characterized in that it comprises the use, by the signal processing means, of a speech rate criterion (C5) of a vocal signal, the speech rate criterion (C5) comprising the module M5 configured so that the vocal signal is received at the first input of the module M5 and so that a signal representative of a division of the vocal signal into respective silence and speech time zones is received at the second input of the module M5, the output signal of the module M5 then being representative of the speech rate level in the vocal signal.
16. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7 and 10, characterized in that it comprises the use, by the signal processing means, of a vocal tonicity criterion (C6) of a vocal signal, the vocal tonicity criterion (C6) comprising the module M9 configured so that the vocal signal is received at the input of the first input of the module M9 and so that a signal representative of a division of the vocal signal into respective silence and speech time zones is received at the second input of the module M9, the output signal of the module M9 then being representative of the level of vocal tonicity in the vocal signal.
17. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7, 9 and 10, characterized in that it comprises the use, by the signal processing means, of a vocal presence criterion (C7) of a vocal signal, the vocal presence criterion (C7) comprising the modules M7, M8 and M20, the module M20 being capable of classifying a vocal signal by level according to a given vocal presence model, these being configured so that the vocal signal is received at the first respective inputs of the modules M7, M8 and M20, so that a signal representative of a division of the vocal signal into respective silence and speech time zones is received at the input of the module M7, and so that the output signals of the modules M7 and M8 are then transmitted respectively to the second inputs of the modules M8 and M20, the output signal of the module M20 then being representative of the vocal presence level in the vocal signal.
18. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7, 9 and 10, characterized in that it comprises the use, by the signal processing means, of a vocal nasality criterion (C9) of a vocal signal, the given voice model criterion (C9) comprising the modules M7, M8 and Ml 3 configured so that the vocal signal is received at the first respective inputs of the modules M7, M8 and Ml 3, so that a signal representative of a division of the vocal signal into respective silence and speech time zones is received at the input of the module M7 and so that the output signals of the modules M7 and M8 are then transmitted respectively to the second inputs of the modules M8 and Ml 3, the output signal of the module Ml 3 then being representative of the voice model level in the vocal signal.
19. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7, 9 and 10, characterized in that it comprises the use, by the signal processing means, of a voice correctness criterion (C12) of a vocal signal, the voice correctness criterion (C12) comprising the modules M7 and Ml 6, these being configured so that the vocal signal is received at the first respective inputs of the modules M7 and Ml 6, so that a signal representative of a division of the vocal signal into respective silence and speech time zones is received at the input of the module M7 and so that the output signal of the module M7 is then transmitted to the input of the module M16, the output signal of the module M16 then being representative of a voice pitch difference in the vocal signal relative to a stored voice pitch model.
20. The method of analyzing at least one vocal signal as claimed in one of the preceding claims combined with claims 7, 9 and 10, characterized in that it comprises the use, by the signal processing means, of a voice intonation criterion (C13) of a vocal signal, the voice intonation criterion (C13) comprising the modules M7 and Ml 7, these being configured so that the vocal signal is received at the first respective inputs of the modules M7 and Ml 7, so that a signal representative of a division of the vocal signal into respective silence and speech time zones is received at the input of the module M7 and so that the output signal of the module M7 is then transmitted to the input of the module M17, the output signal of the module M17 then being representative of a difference in intonation in the vocal signal relative to a stored intonation model.
21. The method of analyzing at least one vocal signal as claimed in one of claims 10 to 20, characterized in that it comprises the use, by the signal processing means, of a criterion, furthermore involving at least one initial operation of processing a vocal signal, each initial processing operation being controlled by a combination of the two modules M2 and M3 which are configured so that at least one vocal signal processed by the criterion is received at the input of the module M2 and at the first input of the module M3 respectively, so that the output signal of the module M2 is then transmitted to the second input of the module M3 and so that the output signal of the module M3 representing a signal representative of a division of the vocal signal into respective silence and speech time zones is then transmitted to one or more other modules of the criterion.
22. The method of analyzing at least one vocal signal as claimed in one of the preceding claims, characterized in that it furthermore comprises the use, by the signal processing means, of a given module or a given combination of given modules, comprising, as input, at least the delivered signal representative of a quality level of the vocal signal according to a given quality criterion and delivering, as output, a signal representative of a diagnostic associated with the quality level according to the given quality criterion represented in the input signal.
23. The method of analyzing at least one vocal signal as claimed in the preceding claim and one of claims 10 to 15 or 17 to 20, optionally combined with claim 21, characterized in that a diagnostic is provided after implementation, by the signal processing means, of a transmission of a signal of a quality level of at least one vocal signal according to a given quality criterion to a module M6, the stored categories of which are diagnostics associated respectively with quality level intervals according to the quality criterion in question, the output signal of the module M6 is then representative of a diagnostic for which the level interval that is associated with it comprises the quality level of the vocal signal.
24. The method of analyzing at least one vocal signal as claimed in claims 7, 9 and 16, optionally combined with claim 21, characterized in that it furthermore comprises the implementation, by the signal processing means, of a transmission of signals delivered by the vocal tonicity criterion (C6) to a set of modules consisting of the modules M7, M8 and M10, the stored categories used during the comparison step over the course of the implementation of the module M10 are diagnostics delimited by representative quantities of given levels of vocal tonicity, each quantity depending on an input sound category of the module, the vocal tonicity criterion (C6) and the modules M7, M8 and M10 being configured so that the vocal signal is furthermore transmitted to the first respective inputs of the modules M7 and M8, the representative of a division of the vocal signal into respective silence and speech time zones transmitted to the vocal tonicity criterion (C6) is furthermore transmitted to the second input of the module M7, the output signal of the module M7 is then transmitted to the second input of the module M8, the output signals of the module M8 and of the module M9 of the vocal tonicity criterion C6 are then respectively transmitted to the second and first inputs of the module M10, the output signal of the module M10 then being representative of a diagnostic associated with the vocal tonicity level of at least one portion of the vocal signal.
25. The method of analyzing at least one vocal signal as claimed in one of the three preceding claims, characterized in that a diagnostic signal is transmitted to a means of storing the diagnostic, in order to be stored therein and/or is transmitted to a display means capable of interpreting the vocal diagnostic signal level so as to visibly display the level of the diagnostic.
26. The method of analyzing at least one vocal signal as claimed in the preceding claim, characterized in that the diagnostic signal transmitted to the display means is capable of generating a particularly visible display if the diagnostic signal has a certain signal level, this particularly visible display acting as a warning.
27. The method of analyzing at least one vocal signal as claimed in one of the three preceding claims, characterized in that at least one quality level signal of at least one portion of at least one vocal signal according to a given quality criterion is transmitted to a means of storing the quality level of a vocal signal in order to be stored therein and/or is transmitted to a display means capable of interpreting the level of the signal so as to visibly display the quality level according to the quality criterion to which the vocal signal belongs.
28. The method of analyzing at least one vocal signal as claimed in the preceding claim, characterized in that the display means allows the variation in time of the quality level of at least one portion of at least one vocal signal according to a given quality criterion to be displayed.
29. A voice control training method, characterized in that it comprises a method of analyzing at least one vocal signal as claimed in one of claims 1 to 21 combined with one of claims 22 to 24, in that each stored diagnostic is associated with at least one proposal of vocal exercises that are suited to the stored diagnostic, and in that the signal representative of the diagnostic provided on the basis of at least one portion of at least one vocal signal is accompanied by the emission of a signal representative of the proposal of vocal exercises that is associated with the diagnostic provided.
30. The voice control training method as claimed in the preceding claim, characterized in that the signal representative of the proposal of vocal exercises that is associated with the diagnostic provided is transmitted to a display means capable of interpreting the level of the signal so as to visibly display this proposal of vocal exercises that is associated with the diagnostic provided.
31. A method of vocal analysis using a speech content criterion (Cl) of a vocal signal carried out by a signal processing device in accordance with a method as claimed in claims 7 and 10, characterized in that it comprises the modules M2, M3 and M4A, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first input of the module M3, the outputs of the modules M2 and M3 being connected to the second inputs of the modules M3 and M4A, respectively.
32. A method of vocal analysis using a common long-silence content criterion (C2) of a vocal signal carried out by a signal processing device in accordance with a method as claimed in claims 7 and 10, characterized in that it comprises n modules M2, n modules M3 and a module M4B having n inputs, the n signal inputs each capable of receiving a vocal signal being each connected to a respective input of a module M2 and to a first input of a module M3 so that each module M2 or M3 can receive only a single vocal signal, the output of each module M2 being connected to the second input of the module M3 that has been able to receive the same vocal signal at its first input as that received by this module M2, each of the outputs of the modules M3 being respectively connected to a single input of the module M4B so that each input of the module M4B can receive only a single signal.
33. A method of vocal analysis using a criterion (C3) for the number of long silences of a given vocal signal carried out by a signal processing device in accordance with a method as claimed in claims 7 and 10, characterized in that it comprises two modules M2, two modules M3 and one module M4C having two inputs, two signal inputs capable of each receiving a vocal signal being each connected to a respective input of a module M2 and to a first input of a module M3 so that each module M2 or M3 can receive only a single vocal signal, the output of each module M2 being connected to the second input of the module M3 that has been able to receive the same vocal signal at its first input as that received by this module M2, each of the outputs of the modules M3 being respectively connected to a single input of the module M4C so that each input of the module M4C can receive only a single vocal signal.
34. A method of vocal analysis using a criterion (C4) for the number of speech interruptions of a first signal carried out by a signal processing device in accordance with a method as claimed in claims 7 and 10, characterized in that it comprises two modules M2, two modules M3 and one module M4D having two inputs, two signal inputs each capable of receiving a vocal signal being each connected to a respective input of a module M2 and to a first input of the module M3 so that each module M2 or M3 can receive only a single vocal signal, the output of each module M2 being connected to the second input of the module M3 that has been able to receive the same vocal signal at its first input as that received by this module M2, each of the outputs of the modules M3 being respectively connected to a single input of the module M4D so that each input of the module M4D can receive only a single signal.
35. A method of vocal analysis using a criterion (C5) for the number of speech rate interruptions carried out by a signal processing device in accordance with a method as claimed in claims 7 and 10, characterized in that it comprises the modules M2, M3 and M5, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first respective inputs of the modules M3 and M5, the outputs of the modules M2 and M3 being connected to the second inputs of the modules M3 and M5, respectively.
36. A method of vocal analysis using a vocal tonicity criterion (C6) carried out by a signal processing device in accordance with a method as claimed in claims 7 and 10, characterized in that it comprises the modules M2, M3 and M9, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first respective inputs of the modules M3 and M9, the outputs of the modules M2 and M3 being connected to the second inputs of the modules M3 and M9, respectively.
37. A method of vocal analysis using a vocal presence criterion (C7) carried out by a signal processing device in accordance with a method as claimed in claims 7, 9 and
10, characterized in that it comprises the modules M2, M3, M7, M8 and M20, the module M20 being capable of classifying a vocal signal per level according to a given vocal presence model, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first respective inputs of the modules M3, M7, M8 and M20, the outputs of the modules M2, M3, M7 and M8 being connected to the second inputs of the modules M3, M7, M8 and M20, respectively.
38. A method of vocal analysis using a vocal tonicity criterion (C9) carried out by a signal processing device in accordance with a method as claimed in claims 7, 9 and 10, characterized in that it comprises the modules M2, M3, M7, M8 and M13, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first respective inputs of the modules M3, M7, M8 and Ml 3, the outputs of the modules M2, M3, M7 and M8 being connected to the second inputs of the modules M3, M7, M8 and Ml 3, respectively.
39. A method of vocal analysis using a voice correctness criterion (C12) carried out by a signal processing device in accordance with a method as claimed in claims 7, 9 and 10, characterized in that it comprises the modules M2, M3, M7 and Ml 6, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first respective inputs of the modules M3, M7 and Ml 6, the outputs of the modules M2 and M3 being connected to the second inputs of the modules M3 and M7, respectively, and the output of the module M7 being connected to the input of the module Ml 6.
40. A method of vocal analysis using a voice intonation criterion (C13) carried out by a signal processing device in accordance with a method as claimed in claims 7, 9 and 10, characterized in that it comprises the modules M2, M3, M7 and M17, one signal input capable of receiving a vocal signal being connected to the input of the module M2 and to the first respective inputs of the modules M3, M7 and M17, the outputs of the modules M2 and M3 being connected to the second inputs of the modules M3 and M7, respectively, and the output of the module M7 being connected to the input of the module Ml 7.
41. A method of vocal analysis using a set of modules for diagnosing a vocal tonicity level carried out by a signal processing device in accordance with a method as claimed in claims 7 and 9, characterized in that it comprises a vocal tonicity criterion
(C6) in accordance with claim 36 and modules M7, M8 and M10, the categories stored and used in the comparison step during the use of the module Ml 0 are diagnostics defined by quantities representative of given vocal tonicity levels, each quantity depending on an input sound category of the module, the signal input (capable of receiving a vocal signal) of the vocal tonicity criterion (C6) being furthermore connected to the input and to the first respective inputs of the modules M7 and M8, the output of the module M3 of the vocal tonicity criterion (C6) being furthermore connected to the second input of the module M7, the output of the module M7 being connected to the second input of the module M8, and the outputs of the module M8 and of the module M9 of the vocal tonicity criterion (C6) being connected to the second and first inputs of the module M10, respectively.
PCT/IB2003/006355 2002-11-27 2003-11-27 Analysis of the vocal signal quality according to quality criteria WO2004049303A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003288475A AU2003288475A1 (en) 2002-11-27 2003-11-27 Analysis of the vocal signal quality according to quality criteria

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR0214865A FR2847706B1 (en) 2002-11-27 2002-11-27 ANALYSIS OF THE QUALITY OF VOICE SIGNAL ACCORDING TO QUALITY CRITERIA
FR0214865 2002-11-27

Publications (1)

Publication Number Publication Date
WO2004049303A1 true WO2004049303A1 (en) 2004-06-10

Family

ID=32241659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2003/006355 WO2004049303A1 (en) 2002-11-27 2003-11-27 Analysis of the vocal signal quality according to quality criteria

Country Status (3)

Country Link
AU (1) AU2003288475A1 (en)
FR (1) FR2847706B1 (en)
WO (1) WO2004049303A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4377158A (en) * 1979-05-02 1983-03-22 Ernest H. Friedman Method and monitor for voice fluency
GB2345183A (en) * 1998-12-23 2000-06-28 Canon Res Ct Europe Ltd Monitoring speech presentation
US6397185B1 (en) * 1999-03-29 2002-05-28 Betteraccent, Llc Language independent suprasegmental pronunciation tutoring system and methods

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4377158A (en) * 1979-05-02 1983-03-22 Ernest H. Friedman Method and monitor for voice fluency
GB2345183A (en) * 1998-12-23 2000-06-28 Canon Res Ct Europe Ltd Monitoring speech presentation
US6397185B1 (en) * 1999-03-29 2002-05-28 Betteraccent, Llc Language independent suprasegmental pronunciation tutoring system and methods

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG K ET AL: "AUDITORY ANALYSIS OF SPECTRO-TEMPORAL INFORMATION IN ACOUSTIC SIGNALS", IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE, IEEE INC. NEW YORK, US, vol. 14, no. 2, 1 March 1995 (1995-03-01), pages 186 - 194, XP000505069, ISSN: 0739-5175 *

Also Published As

Publication number Publication date
FR2847706B1 (en) 2005-05-20
FR2847706A1 (en) 2004-05-28
AU2003288475A1 (en) 2004-06-18

Similar Documents

Publication Publication Date Title
US9165562B1 (en) Processing audio signals with adaptive time or frequency resolution
EP2549475B1 (en) Segmenting audio signals into auditory events
EP1393300B1 (en) Segmenting audio signals into auditory events
Jensen Timbre models of musical sounds
USRE43406E1 (en) Method and device for speech analysis
Mion et al. Score-independent audio features for description of music expression
US20060165239A1 (en) Method for determining acoustic features of acoustic signals for the analysis of unknown acoustic signals and for modifying sound generation
Van Zijl et al. The sound of emotion: The effect of performers’ experienced emotions on auditory performance characteristics
JP2008516288A (en) Extraction of melody that is the basis of audio signal
Seppänen et al. Prosody-based classification of emotions in spoken finnish.
US5522013A (en) Method for speaker recognition using a lossless tube model of the speaker&#39;s
WO2004049303A1 (en) Analysis of the vocal signal quality according to quality criteria
Półrolniczak et al. Analysis of the signal of singing using the vibrato parameter in the context of choir singers
JP2003057108A (en) Sound evaluation method and system therefor
Daikoku Shaping the Epochal Individuality and Generality: The Temporal Dynamics of Uncertainty and Prediction Error in Musical Improvisation
Chétry Computer models for musical instrument identification
McPherson The Perception of Harmonic Sounds
Stöter Separation and Count Estimation for Audio Sources Overlapping in Time and Frequency
US20110153316A1 (en) Acoustic Perceptual Analysis and Synthesis System
Półrolniczak et al. Analysis of the dependencies between parameters of the voice at the context of the succession of sung vowels
Park Musical Instrument Extraction through Timbre Classification
JP3899122B6 (en) Method and apparatus for spoken interactive language teaching
JP3899122B2 (en) Method and apparatus for spoken interactive language teaching
CN117746901A (en) Deep learning-based primary and secondary school performance scoring method and system
Puterbaugh Between location and place: A view of timbre through auditory models and sonopoietic space

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP