WO2001086633A1 - Detection d'activite vocale et d'extremite de mot - Google Patents

Detection d'activite vocale et d'extremite de mot Download PDF

Info

Publication number
WO2001086633A1
WO2001086633A1 PCT/IT2001/000221 IT0100221W WO0186633A1 WO 2001086633 A1 WO2001086633 A1 WO 2001086633A1 IT 0100221 W IT0100221 W IT 0100221W WO 0186633 A1 WO0186633 A1 WO 0186633A1
Authority
WO
WIPO (PCT)
Prior art keywords
vad
epd
value
frame
frames
Prior art date
Application number
PCT/IT2001/000221
Other languages
English (en)
Inventor
Francesco Beritelli
Original Assignee
Multimedia Technologies Institute - Mti S.R.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Multimedia Technologies Institute - Mti S.R.L. filed Critical Multimedia Technologies Institute - Mti S.R.L.
Priority to AU58752/01A priority Critical patent/AU5875201A/en
Publication of WO2001086633A1 publication Critical patent/WO2001086633A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This invention relates to a low complexity and environment noise unresponsive method for detection of voice activity as well as to a method for segmentation of isolated words utilising said voice activity detection method and to related apparatuses.
  • VAD Voice Activity Detection
  • a normal telephonic conversation includes time periods in which a voice activity occurs (amounting to average values of about 40%) and during which at least a speaker speaks, and time periods in which no voice activity occurs, characterised by a silent condition or by presence of only environment noise (amounting to average values of about 60%), in which both speakers are listening at one another or are making pauses between isolated words or within a single word.
  • the voice activity algorithm operates on voice segments (frames) having a duration of 10-20 ms.
  • a VAD device presents various interesting application scenarios, among which the two main application scenarios of voice coding and recognition are included.
  • VAD device In the first case, a VAD device is utilised:
  • discontinuous transmission namely an operation mode in which the transmission is disabled during all pause periods during a conversation, with the noticeable benefit that the channel band is restricted and the transmission capability of a communication system is increased (radio mobile systems, satellite links, voice communications on Internet); and - to lower the storage capability need and consequently the costs in voice storing systems (telephonic answering machines, voice files, etc.).
  • a VAD device as applied to radio mobile systems, such as GSM or UMTS, allows to reduce both any co-channel interference, thereby increasing the system capability in respect of the number of users enabled to access any base station, and the power consumption by the concerned mobile terminal, thereby increasing the average life of the energy charge.
  • a VAD device represents the first stage in a classification of variable bit rate (VBR) voice coders, in which a particular coding model is associated to each considered phonetic class, so that the bit rate dynamically matches the local characteristics of the transmitted voice signal.
  • VBR variable bit rate
  • a VAD device forms the first processing stage typically present in the word end point detector or EPD, as described by Lawrence Rabiner in "Applications of voice processing to telecommunications", Proceedings of the IEEE, vol. 82, no. 2, February 1994, pages 199-227.
  • the word boundaries or end points are the two starting and ending times of a word spoken by a speaker.
  • the limitations of a conventional VAD device are the responsivity to the environmental noise and the processing load that make its real time execution particularly complex.
  • a simple comparison of the measured energy value to a threshold value is no more sufficient, due to the fact that many noise frames would be construed as voice frames and many voice frames would be erroneously construed as noise frames.
  • the environmental noise is overposed to the voice, the phonetic contents of the latter are altered, thereby noticeably complicating a correct identification process of the speech segments with respect to the pure environmental noise.
  • VAD voice activity frames construed as silence or environmental noise and vice versa
  • voice coding operations entail the insertion of holes into the conversation and consequently a certain degradation in quality as perceived by the user.
  • an erroneously operating VAD device entails an erroneous identification of the word boundaries that strongly influences the performances in terms of recognition rate.
  • Figure 1 (appearing in Lawrence Rabiner and Biing- Hwang Juang “Fundamentals of Speech Recognition", Prentice - Hall International, Inc., April 1993) shows a graph of total accuracy, expressed as percentage, in recognition of digits as a function of the end point position variation, expressed as milliseconds. It is apparent that small errors in the estimation of the word boundaries often cause relatively significant degradations. By way of example, an error in the start point evaluation of ⁇ 30 ms ( ⁇ 3 frames) causes an accuracy decrease of 2%. By spacing the end points from the manual mark, the recognition accuracy further decreases. Even if the so obtained results undoubtfully depend on the implemented recognition system, Figure 1 significantly evidences the strict connections existing between the EPD process and the performances of the recognition process.
  • a further aspect to be considered is related to the computational simplicity, which even now often represents a limitation particularly for applications in mobile apparatuses.
  • VAD devices generally operate according so-called "threshold” algorithms, namely algorithms based upon decision criteria which utilise fixed values - in the case of fixed threshold algorithms - or variable values as a function of the signal local behaviour - in the case of adaptive threshold algorithms.
  • the evaluation of the local signal characteristics is based upon whether suitably chosen parameters exceed or not said threshold values, thereby determining a decision about the nature of the signal itself.
  • a binary type information known as "flag" will return the result of such decision in terms of presence or absence of the voice signal.
  • VAD Voice Activity Detection
  • the ETSI GSM VAD device is substantially an energy detector based upon an adaptive type threshold mechanism. It receives at its input port voice frames of 20 ms and, for each of them, it should be capable to establish whether only background noise or active speech is involved.
  • the threshold level should be sufficiently higher than the noise level, in order to prevent it from being identified as voice, but it should not be much higher than it, in order to prevent low energy level voice segments from being identified as noise.
  • a suitable threshold location therefore, is essential for a good operation of this VAD device.
  • the input voice signal is filtered by means of an adaptive analysis filter, whose coefficients are computed starting from the autocorrelation coefficients of the input signals, averaged over four consecutive frames. This averaging step enables to carry out a filtering operation aimed at lowering the noise contents overposed to the voice. As a result, a more reliable voice/noise discrimination is obtained.
  • Both the threshold value and the adaptive filter coefficients are up-dated only when no speech is present, that is only during periods in which only noise is present, while, in speech containing periods, the up-dating step of the previous noise containing period is valid.
  • the ETSI GSM VAD device provides for an additional fixed threshold located at a very low level, such that any signal having a level lower than said threshold is considered as background noise.
  • the ITU - T G. 729 Annex B VAD device standardised for the 8 kbit/s ITU - T G 729 coder, enables a selection between two coding modes, according to whether the considered frame is an activity or an inactivity frame.
  • the decision of the VAD device is taken on frames having a duration of 10 ms.
  • the input signal is filtered by a high - pass filter having a cut-off frequency of 140 Hz in order to eliminate any undesired low frequency components.
  • the parameters needed to carry out the classification are fetched according to a frame - by - frame procedure, in particular energy Ef in the broad band of 0 ⁇ 4 kHz, energy £, in the low band of 0 ⁇ 1 kHz, zero crossing rate ZCR and a set S of 12 Line Spectral Frequencies (LSF).
  • the activity decision is taken only according to the energy parameter.
  • the energy parameter When the latter is higher than 15 dB, the decision is in favour of activity, otherwise it is in favour of inactivity.
  • an initialisation stage is present in respect of the long term average values that will be used for all frames subsequent to the initial ones for computing the following four differential parameters: differential energy ⁇ E g in the whole band; differential energy ⁇ E/m the low band; - spectral distortion ⁇ S; and differential zero crossing rate ⁇ ZCR.
  • These parameters represent the difference between the effective value of a parameter and its average value computed in adaptive mode in the last noise frames.
  • the differential parameters are generated according to the following formulas:
  • the subsequent stage is a matching stage in which an initial decision in respect of activity is taken by considering different regions in the space of the above four differential parameters.
  • the activity decision is given by the combination of the decision regions, while the non - activity decision is simply given by the complementary region.
  • ⁇ Pi and ⁇ P k are two of the four differential parameters and a and b are suitable constants. If no one of the above disequalities is fulfilled, a Boolean flag of the VAD device is made equal to 0 (silence or environmental noise). Otherwise, if at least one of the above disequalities is fulfilled, the decision flag is made equal to 1 (voice).
  • the initial decision is filtered by means of a levelling or smoothing block, on the base of the two previous frames, in order to avoid abrupt changes between activity and non-activity conditions.
  • the last block is related to the up-dating function for the average values. Such up-dating operation should be effected only in respect of the non-activity frames.
  • the decision that the frames are activity or non-activity frames is taken by a secondary VAD device that enables or disables the average value updating function.
  • VAD device G. 729 An improved implementation of a VAD device G. 729 was recently proposed by F. Beritelli, S. Casale e A. Cavallaro in "A Robust Voice Activity Detector for Wireless Communications Using Soft Computing", IEEE Journal on Selected Areas in Communications (JSAC), special Issue on Signal Processing for Wireless Communications, Vol. 16, n. 9, Dec. 1998.
  • a new approach in respect of the matching stage based upon a set of rules of "fuzzy" logic, suitably obtained after a training stage, has been proposed.
  • a system of six rules receives as input the four differential parameters and produces as output, frame - by frame, a continuous value in the range of 0 to 1.
  • the VAD device decision can be obtained.
  • the fuzzy VAD device offers better performances, since it has a lower number of misclassifications.
  • the voice signal is firstly processed by a module that measures a set of parameters. Subsequently, the boundaries of the word to be recognised are established by means of a threshold decision mechanism.
  • the energy and ZCR functions are subsequently computed for the whole input signal over frames of 10 ms duration.
  • the execution begins by locating, starting from the first frame, the point at which the energy profile overcomes both the lower and the upper energy thresholds, noting that it should not descend below the lower energy threshold before having overcome the upper one. Such point, upon being identified by the lower threshold is provisionally marked as initial end point of the word.
  • the provisional end point of the word is located.
  • the algorithm proceeds with examining those periods having a duration of 250 ms that precede the provisional initial end point and follow the final one.
  • the number of frames in which the ZCR figure overcomes the relative threshold level is counted. If such number is equal to or higher than three, the definitive initial end point is back displaced to the last index of the frame at which the ZCR figure overcomes the threshold level; otherwise, it remains unaltered. The same procedure applies to the final end point.
  • the Rabiner and Sambur method has its main limit just in the presence of stationary and non-stationary background noises, due to the fact that it cannot satisfactorily discriminate the voice signal when the SNR figure is lower then 30 dB.
  • an initial end point is obtained by observing M preceding frames, in the case that Ep > T for (a M) times in those M frames, where 0 ⁇ a ⁇ 1.
  • a final end point is obtained by observing M preceding frames, in the case that Ep ⁇ T for (a M) times in those M frames.
  • This algorithm also automatically establishes the threshold value T during the first frames of 300 ms duration of initial silence.
  • the Tsao and Gray method In absence of noise, the Tsao and Gray method, even if it utilises a single parameter acceptably operates even in the case of weak fricative sounds, such as T, V and V.
  • TF time - frequency
  • the logarithm of the r.m.s. value of the not-limited band energy is computed and then normalised and filtered.
  • the final TF parameter is obtained after filtering the sum of the two energy functions.
  • a noise adaptive threshold is computed starting from the first frames of the input signal and then the begin of the first vowel and the end of the last one (fundamental limits) are established by comparing the TF parameter to the above mentioned adaptive threshold.
  • a trimming procedure that also utilises the ZCR figure is applied by reversely running a fixed distance of 100 ms starting from the begin of the first vowel and of 150 ms starting from the end of the last vowel.
  • the EPD Fuzzy method is comprised by a first processing stage represented by the above mentioned VAD Fuzzy method, followed by a post - processing stage that suitably processes the output of the VAD Fuzzy method in order to establish the word boundaries.
  • a median filter of the seventh order eliminates any abrupt variations existing in the fuzzy output; subsequently the end points are obtained from the intersection of the filter output with a fixed threshold having a value 9.
  • the performances of the EPD Fuzzy method are better than the previously analysed EPD algorithms, when either the environmental noise type or the SNR figure are varied.
  • the EPD Fuzzy method appears to be robust in respect of the level variations of the signal.
  • the EPD Fuzzy method has been found not to be quite satisfactory in respect of the so introduced processing load and of the robustness in the presence of noise.
  • a first step aimed at acquiring the voice signal divided in segments or frames having a time duration d of the frames not longer than 40 milliseconds (ms), more preferably in the range of 10 to 20 ms, still more preferably equal to 10 ms,
  • Y min corresponds to a silence frame and Y max corresponds to a voice activity frame.
  • the three parameters of the energy differential over the whole band ⁇ E f , the energy differential over the band 0 - 1 kHz, ⁇ Ei and the zero crossing rate differential, ⁇ ZCR can be computed, for each frame, in the second computation step, and said third neural network process step is based upon said three parameters ⁇ Ef,
  • said neural network can be trained by means of the "Delta Learning Rule" and the voice signal employed for said training procedure can be a clean or noiseless voice signal and/or further audio signals obtained by adding babble noise, car noise, traffic noise and white noise, respectively, to said clean signal, with a SNR equal to 20 dB and 10 dB and possibly also with a SNR equal to 0 dB.
  • said neural network includes a perceptron having three inputs, an output and nine nodes in an intermediate stage, in which, still more preferably, the relationships between the inputs and the intermediate outputs as well as between the latter and the network output are linear relationships, still more preferably with pre-established and constant coefficients.
  • the VAD method can comprise, after the third neural network processing step, a further step for comparing the output values Y of the neural network to a threshold value, preferably furnished by the arithmetical mean value of Y mifact and Y max
  • an apparatus for detection of voice activity or VAD comprising one or more units for processing the voice signal, characterised in that it further comprises a neural network for receiving at its input port the data processed by said processing units, and in that it carries out the VAD method according to this invention.
  • the VAD apparatus can further comprise a high-pass filter arranged upstream to said processing units and/or a final comparison unit for comparing the output signal of the neural network to a threshold value.
  • It is also specific subject matter of this invention to provide an apparatus for segmentation of isolated words or EPD apparatus comprising one or more units for establishing the word boundaries, characterised in that it further comprises an apparatus for detection of voice activity or VAD apparatus connected upstream to said word boundary establishing units, and in that it carries out an EPD method according to this invention.
  • Figure 1 is a graph of the total accuracy, expressed as percentage, of the digit recognition, as a function of the variation, expressed measured in milliseconds, of end point positions,
  • FIG. 2 is a block diagram of the referred embodiment of the neural VAD apparatus according to this invention.
  • FIG. 3 is a block diagram of a preferred embodiment of the EPD apparatus according to this invention.
  • the spectral distortion parameter AS has not been considered due to the fact that it has a complexity higher by at least one order of magnitude with respect to the other three parameters.
  • the abovesaid parameters have been computed over frames of 10 ms duration and in connection with an assembly of words comprising the Italian numerals (zero, one, two, three, four, five, six, seven, eight, nine) and some commands (delete, call, no, yes, OK, record, check) voiced by four speakers, two male and two female speakers).
  • the starting database which consists in 68 words, has been scaled up to a level equal to -15.86 dBrrio and having four different noise types, specifically babble noise, traffic noise, car noise and white noise, digitally added thereto, with three different SNR figures of 20 dB, 10 dB and 0 dB, respectively.
  • the VAD device should be able to discriminate between silence or pause frames and voice activity frames, it has been decided to exclusively consider - for each word - the twenty frames that are straddling over each of the initial and final end points, the latter being detected by an ideal mark.
  • each scenario is formed by a number of 2720 vectors of 37 components equally distributed in the two above mentioned classes of voice activity and no-voice activity. Ordering of the various parameters has been carried out for each scenario. Table I shows the parameters related to the first eight ordered positions in each of said thirteen scenarios.
  • the preferred embodiment of the method for detection of voice activity according to this invention provides for the VAD method be only based on the three following parameters ⁇ E f , AEi andAR .
  • the Inventor realised a non-linear matching technique that turns out to be particularly robust in respect of various noise types that are overposed to voice in a telephonic conversation carried out in noiseful environments.
  • Such technique is based upon use of a suitably trained neural network which forms the matching block.
  • This block receives the individual parameters at its input port and outputs frame-by-frame a value in the range of 0 (pause) and 1 (voice activity). The result is a value in the range of 0 to 1 which is subsequently compared to a threshold value equal to 0.5 for final decision.
  • the neural network includes a perceptron having three inputs (parameters ⁇ E f , ⁇ Ei and ⁇ ZCR) , an output (in the range between 0 and 1) and nine nodes in an intermediate stage.
  • the network is trained by means of the above mentioned “Delta Learning Algorithm”.
  • the matching block is trained by means of the parameters retrieved from the "clean" signal and by adding "babble", "traffic” and “white” noises thereto, with signal-to-noise ratios of 20, 10 and 0 dB.
  • the training operation is carried out in such a way that the neural network is adapted to furnish, as a result of a given input value assembly, an output in the value range [0 - 1], in which values tending to 1 indicate a presence of voice activity and values tending to 0 indicate an absence of voice activity.
  • the values of coefficients Wy , b; , W,, B Upon ending the training stage, the values of coefficients Wy , b; , W,, B.
  • the block diagram of the neural VAD device operates in such a way that the voice signal 1 is initially filtered by a high-pass filter 2 having a cut-off frequency in the range of 130 to 170 Hz, preferably a frequency near to 150 Hz, in order to eliminate all low frequency noise components included in the concerned signal.
  • the concerned signal is subsequently transferred to a first, a second and a third processing units 3, 4 and 5, respectively, that, for each frame, compute energy Ef in broad band 0 to 4 kHz , energy E t in low band 0 to 1 kHz and the zero crossing rate ZCR .
  • the above three processing units 3, 4 and 5 perform said computation following the same mechanism as adopted in the ITU-T G.729 VAD device.
  • a fourth processing unit 6 receives the computed values Ef, E ⁇ and ZCR and computes their respective average values E f , E, and ZCR , preferably in adaptive way according to the process adopted in the ITU-T G.729 VAD device.
  • the computed values E f and the average values E f are applied to a fifth processing unit 7 which computes their differential values ⁇ E f .
  • the computed values Et and the relative average values E l are applied to a sixth processing unit 8 which computes their differential values Ei
  • the computed values ZCR and their respective average values ZCR are applied to a seventh processing unit 9 which computes their differential values ⁇ ZCR.
  • the computed differential values ⁇ E f , and ⁇ ZCR are applied to said neural network 10 which furnishes frame-by-frame an output value Yin the range of 0 to 1.
  • a comparison unit 11 compares the output values Yto a threshold value and furnishes a decision Boolean value or flag D in respect of the classification of the signal as voice activity signal or non-voice activity signal.
  • a post-processing unit can be provided to eliminate all errors caused by evaluating some voice segments as noise.
  • said post-processing unit is based upon the so-called "hangover mechanism".
  • the Inventor realised an ⁇ PD device for detection of the word boundaries in order to effect segmentation of isolated words, by adding a post-processing stage to the output of the neural network.
  • the output Y of the neural network 10 of the VAD device is processed in a smoothing unit 12, preferably implemented as a median filter of the seventh order, aimed at reducing any abrupt variation in the concerned signal.
  • a provisional marking unit 13 analyses the shape of the signal V emitted by said smoothing unit 12 on a frame-by-frame basis.
  • the value V,- relating to the i-th frame is compared to the value Si of a fixed threshold : the provisional point P' ⁇ at which a word begins is coarsely established as the final end point of a window comprising ⁇ / 2 frames, in which the relation V, > Si applies to at least a pre-established number Ni of frames, where N 2 > Ni .
  • This criterion permits to prevent any noise peaks generally having a limited time duration from being construed as words.
  • the provisional point P' F at which a word ends is coarsely established as the initial end point of a window comprising N4 frames, in which the relation V,- ⁇ S'i applies to at least a pre-established number N3 of frames, where N 4 > N 3 .
  • the threshold value for establishing the coarse point P'F can be different from the threshold value utilised to establish the coarse point P , even if it is anyway in the range of [0.1 to 0.5] (in the general case it is in the range of [Y rrin + 0,l*(Y nm -Y ⁇ s ⁇ a )] to
  • Said trimming unit 14 utilises a further threshold S 2 > Si and analyses the sign of the first derivative of signal l outcoming from said smoothing unit 12, in order to ascertain any slope change.
  • the value of S 2 is in the range of [0.5 to 0.9] (in the general case it is in the range of fc + 0,5*fca ⁇ -OI t0 fc* +0>9*fca -OI) and sti " more
  • S 2 0.6 (in the general case it is [r ⁇ + 0,6 * (Y ⁇ - O] ) ⁇
  • the initial point P t is established, in a window immediately preceding the coarse initial end point P' ⁇ of the word and including a pre-established number N t of frames, as the nearest point to P'i where V, ⁇ S 2 or where the derivative of V, changes its sign. If no one of said events occurs, then point is taken as Pi - P'i - Ni frames.
  • ⁇ , 10 in order to reduce the computation load needed in the voice recognition stage, particularly in real time applications.
  • the threshold value to be utilised for establishing the final point P can be different from the value utilised for establishing the initial point Pi , even if it is still included in the in the range of [0.5 to 0.9] (in the general case it is in the range of fc min + 0,5*(F tnax -F ⁇ l )] to
  • a further final check step is carried out on the results of the performed recognition procedure and consequently all words having a duration shorter than a minimum time interval P m) ⁇ , preferably not longer than 300 ms and still more preferably equal to 100 ms, as well as all words having a duration longer than a maximum time interval P max , preferably not shorter than 1 second and still more preferably equal to 2 seconds, are discarded.
  • P m minimum time interval
  • P max preferably not shorter than 1 second and still more preferably equal to 2 seconds
  • the concerned apparatus comprises a finished state machine aimed at distinguishing the various operation conditions of the EPD detector, as well as a number of storage or buffer registers that enable to store the results relating to the latest frames for establishing said coarse marking and then said trimming stages.
  • This automatic apparatus is characterised by the following four states: A. initial or non-activity state, B. activity begin or false alarm state,
  • the above mentioned state machine starts from an initial state A and can run through its various states by establishing the frames P'i and P' F connected with said coarse marking, while the analysis of the buffer contents makes it possible to carry out said trimming step after having located said coarse initial and final points P'i and P' F , respectively.
  • the fourth processing unit 6 of the VAD section of the EPD detector carries out a simplified computation of the average values E f , E, and ZCR of parameters E f , E ⁇ and ZCR .
  • the average value of each parameter is established by analysing an initial portion of the signal, preferably having a duration in the range of 100 to 500 ms. Still more preferably, exclusively an initial portion of the signal having a net duration of 300 ms is considered, starting from the time at which the user switches on the apparatus (or the method is started). This is justified by the assumption that such initial portion of the signal represents the environmental noise that adds to the subsequently voiced word.
  • a maximum time period ⁇ t preferably in the range of 2 to 6 seconds, can be provided between the time point at which the method or the apparatus is started and the begin point of the concerned word, such that, upon expiring this time period, the method or the apparatus is returned to the initial stage, thereby indicating no voice command has been recorded and consequently recognised.
  • WTE Weighted Total Error
  • SFA Start Front Advance
  • SFC Start Front Clipping
  • EFC End Front Clipping
  • End Front Delay defined as the number of frames by which the final point of the automatic marking is delayed with respect to the manual one.
  • a further embodiment of the method for detection of voice activity according to this invention provides for a VAD device based upon the three parameters ⁇ E f , ⁇ E ⁇ and ⁇ ZRC as well on the second and fifth cepstral coefficients, Ci and c 2 , respectively.
  • said neural network 10 is trained by means of parameters extracted from the "clean" signal having "babble” noise, “car” noise, “traffic” noise and “white” noise, respectively, added thereto, with a SNR equal to 20 dB and 10 dB or by means of parameters exclusively extracted from the "clean" signal.
  • the method for segmentation of isolated words appears to be particularly robust in respect of the environmental noise and has the further advantage of a reduced computational complexity, in view of the reduced number and of the simplicity of all parameters utilised therein, as well as in view of the post-processing and matching algorithm simplicity, which enables the end points of a word to be obtained in reliable way.
  • the EPD recognition method according to this invention utilises fixed thresholds and is of the so-called "forward" type, that is to say that it analyses the word only in forward direction, so that it does not require long buffers to store the signal to be processed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

L'invention concerne un procédé de détection d'activité vocale dans un signal vocal, en particulier pour les applications téléphoniques, qui comprend les étapes suivantes: acquisition du signal vocal (1) divisé en segments ou trames de durée d; calcul pour chaque trame d'au moins trois des cinq paramètres suivants: différence d'énergie sur toute la bande ΔEf, différence d'énergie sur la bande ΔEl comprise entre 0 et 1 kHz, différence de taux de passage par zéro ΔZCR, second coefficient cepstral c2 et cinquième coefficient cepstral c5; processus de réseau neuronal en vue de fournir, pour chaque trame et sur la base d'au moins trois des cinq paramètres, une valeur de sortie Y dans la gamme définie par une valeur minimum Ymin et par une valeur maximum Ymax, avec Ymin <Ymax. L'invention concerne également un appareil de détection d'activité vocale pour ce procédé, un procédé de segmentation de mots isolés ou procédé de détection d'extrémité de mot, comprenant les étapes du procédé de détection d'activité vocale, et un appareil correspondant pour ce procédé de détection d'extrémité de mot.
PCT/IT2001/000221 2000-05-10 2001-05-08 Detection d'activite vocale et d'extremite de mot WO2001086633A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU58752/01A AU5875201A (en) 2000-05-10 2001-05-08 Voice activity detection and end-point detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
ITRM2000A000248 2000-05-10
IT2000RM000248A IT1315917B1 (it) 2000-05-10 2000-05-10 Metodo di rivelazione di attivita' vocale e metodo per lasegmentazione di parole isolate, e relativi apparati.

Publications (1)

Publication Number Publication Date
WO2001086633A1 true WO2001086633A1 (fr) 2001-11-15

Family

ID=11454720

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IT2001/000221 WO2001086633A1 (fr) 2000-05-10 2001-05-08 Detection d'activite vocale et d'extremite de mot

Country Status (3)

Country Link
AU (1) AU5875201A (fr)
IT (1) IT1315917B1 (fr)
WO (1) WO2001086633A1 (fr)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1504440A2 (fr) * 2002-05-14 2005-02-09 Thinkengine Networks Inc. Detection d'activite vocale
WO2005119649A1 (fr) * 2004-05-25 2005-12-15 Nokia Corporation Systeme et procede de detection de murmures confus
US7680657B2 (en) 2006-08-15 2010-03-16 Microsoft Corporation Auto segmentation based partitioning and clustering approach to robust endpointing
US8218529B2 (en) 2006-07-07 2012-07-10 Avaya Canada Corp. Device for and method of terminating a VoIP call
US8468131B2 (en) 2006-06-29 2013-06-18 Avaya Canada Corp. Connecting devices in a peer-to-peer network with a service provider
US8843369B1 (en) 2013-12-27 2014-09-23 Google Inc. Speech endpointing based on voice profile
WO2016008311A1 (fr) * 2014-07-18 2016-01-21 华为技术有限公司 Procédé et dispositif pour détecter un signal audio selon une énergie de domaine fréquentiel
CN106486136A (zh) * 2016-11-18 2017-03-08 腾讯科技(深圳)有限公司 一种声音识别方法、装置及语音交互方法
US9607613B2 (en) 2014-04-23 2017-03-28 Google Inc. Speech endpointing based on word comparisons
CN106611598A (zh) * 2016-12-28 2017-05-03 上海智臻智能网络科技股份有限公司 一种vad动态参数调整方法和装置
JP2017530409A (ja) * 2014-09-26 2017-10-12 サイファ,エルエルシー ランニング範囲正規化を利用したニューラルネットワーク音声活動検出
CN108428448A (zh) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 一种语音端点检测方法及语音识别方法
CN108648769A (zh) * 2018-04-20 2018-10-12 百度在线网络技术(北京)有限公司 语音活性检测方法、装置及设备
CN108877778A (zh) * 2018-06-13 2018-11-23 百度在线网络技术(北京)有限公司 语音端点检测方法及设备
US10269341B2 (en) 2015-10-19 2019-04-23 Google Llc Speech endpointing
US10593352B2 (en) 2017-06-06 2020-03-17 Google Llc End of query detection
CN111028858A (zh) * 2019-12-31 2020-04-17 云知声智能科技股份有限公司 一种人声起止时间检测方法及装置
WO2020192009A1 (fr) * 2019-03-25 2020-10-01 平安科技(深圳)有限公司 Procédé de détection de silence reposant sur un réseau neuronal, et dispositif terminal et support
US10929754B2 (en) 2017-06-06 2021-02-23 Google Llc Unified endpointer using multitask and multidomain learning
CN112652296A (zh) * 2020-12-23 2021-04-13 北京华宇信息技术有限公司 流式语音端点检测方法、装置及设备
US11062696B2 (en) 2015-10-19 2021-07-13 Google Llc Speech endpointing
CN113284496A (zh) * 2021-07-22 2021-08-20 广州小鹏汽车科技有限公司 语音控制方法、语音控制系统、车辆、服务器和存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105719642A (zh) * 2016-02-29 2016-06-29 黄博 连续长语音识别方法及系统、硬件设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BERITELLI F: "A robust endpoint detector based on differential parameters and fuzzy pattern recognition", PROCEEDINGS OF ICSP'98: FOURTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, BEIJING, CHINA, vol. 1, 12 October 1998 (1998-10-12) - 16 October 1998 (1998-10-16), IEEE, Piscataway, NJ, USA, pages 601 - 604, XP002173614, ISBN: 0-7803-4325-5 *
GHISELLI-CRIPPA T ET AL: "A FAST NEURAL NET TRAINING ALGORITHM AND ITS APPLICATION TO VOICED-UNVOICED-SILENCE CLASSIFICATION OF SPEECH", IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH & SIGNAL PROCESSING (ICASSP '91), 14 May 1991 (1991-05-14), IEEE, New York, NY, USA, pages 441 - 444, XP000245262, ISBN: 0-7803-0003-3 *

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1504440A2 (fr) * 2002-05-14 2005-02-09 Thinkengine Networks Inc. Detection d'activite vocale
EP1504440A4 (fr) * 2002-05-14 2006-02-08 Thinkengine Networks Inc Detection d'activite vocale
WO2005119649A1 (fr) * 2004-05-25 2005-12-15 Nokia Corporation Systeme et procede de detection de murmures confus
US8788265B2 (en) 2004-05-25 2014-07-22 Nokia Solutions And Networks Oy System and method for babble noise detection
US8468131B2 (en) 2006-06-29 2013-06-18 Avaya Canada Corp. Connecting devices in a peer-to-peer network with a service provider
US8218529B2 (en) 2006-07-07 2012-07-10 Avaya Canada Corp. Device for and method of terminating a VoIP call
US7680657B2 (en) 2006-08-15 2010-03-16 Microsoft Corporation Auto segmentation based partitioning and clustering approach to robust endpointing
US8843369B1 (en) 2013-12-27 2014-09-23 Google Inc. Speech endpointing based on voice profile
US11004441B2 (en) 2014-04-23 2021-05-11 Google Llc Speech endpointing based on word comparisons
US10140975B2 (en) 2014-04-23 2018-11-27 Google Llc Speech endpointing based on word comparisons
US9607613B2 (en) 2014-04-23 2017-03-28 Google Inc. Speech endpointing based on word comparisons
US11636846B2 (en) 2014-04-23 2023-04-25 Google Llc Speech endpointing based on word comparisons
US10546576B2 (en) 2014-04-23 2020-01-28 Google Llc Speech endpointing based on word comparisons
WO2016008311A1 (fr) * 2014-07-18 2016-01-21 华为技术有限公司 Procédé et dispositif pour détecter un signal audio selon une énergie de domaine fréquentiel
US10339956B2 (en) 2014-07-18 2019-07-02 Huawei Technologies Co., Ltd. Method and apparatus for detecting audio signal according to frequency domain energy
JP2017530409A (ja) * 2014-09-26 2017-10-12 サイファ,エルエルシー ランニング範囲正規化を利用したニューラルネットワーク音声活動検出
EP3198592A4 (fr) * 2014-09-26 2018-05-16 Cypher, LLC Détection d'activité vocale de réseau neuronal employant une normalisation de plage d'exécution
US11062696B2 (en) 2015-10-19 2021-07-13 Google Llc Speech endpointing
US10269341B2 (en) 2015-10-19 2019-04-23 Google Llc Speech endpointing
US11710477B2 (en) 2015-10-19 2023-07-25 Google Llc Speech endpointing
CN106486136A (zh) * 2016-11-18 2017-03-08 腾讯科技(深圳)有限公司 一种声音识别方法、装置及语音交互方法
CN106611598A (zh) * 2016-12-28 2017-05-03 上海智臻智能网络科技股份有限公司 一种vad动态参数调整方法和装置
CN106611598B (zh) * 2016-12-28 2019-08-02 上海智臻智能网络科技股份有限公司 一种vad动态参数调整方法和装置
CN108428448A (zh) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 一种语音端点检测方法及语音识别方法
US10593352B2 (en) 2017-06-06 2020-03-17 Google Llc End of query detection
US11551709B2 (en) 2017-06-06 2023-01-10 Google Llc End of query detection
US10929754B2 (en) 2017-06-06 2021-02-23 Google Llc Unified endpointer using multitask and multidomain learning
US11676625B2 (en) 2017-06-06 2023-06-13 Google Llc Unified endpointer using multitask and multidomain learning
CN108648769A (zh) * 2018-04-20 2018-10-12 百度在线网络技术(北京)有限公司 语音活性检测方法、装置及设备
US20190385636A1 (en) * 2018-06-13 2019-12-19 Baidu Online Network Technology (Beijing) Co., Ltd. Voice activity detection method and apparatus
US10937448B2 (en) 2018-06-13 2021-03-02 Baidu Online Network Technology (Beijing) Co., Ltd. Voice activity detection method and apparatus
CN108877778B (zh) * 2018-06-13 2019-09-17 百度在线网络技术(北京)有限公司 语音端点检测方法及设备
CN108877778A (zh) * 2018-06-13 2018-11-23 百度在线网络技术(北京)有限公司 语音端点检测方法及设备
WO2020192009A1 (fr) * 2019-03-25 2020-10-01 平安科技(深圳)有限公司 Procédé de détection de silence reposant sur un réseau neuronal, et dispositif terminal et support
CN111028858B (zh) * 2019-12-31 2022-02-18 云知声智能科技股份有限公司 一种人声起止时间检测方法及装置
CN111028858A (zh) * 2019-12-31 2020-04-17 云知声智能科技股份有限公司 一种人声起止时间检测方法及装置
CN112652296A (zh) * 2020-12-23 2021-04-13 北京华宇信息技术有限公司 流式语音端点检测方法、装置及设备
CN112652296B (zh) * 2020-12-23 2023-07-04 北京华宇信息技术有限公司 流式语音端点检测方法、装置及设备
CN113284496B (zh) * 2021-07-22 2021-10-12 广州小鹏汽车科技有限公司 语音控制方法、语音控制系统、车辆、服务器和存储介质
CN113284496A (zh) * 2021-07-22 2021-08-20 广州小鹏汽车科技有限公司 语音控制方法、语音控制系统、车辆、服务器和存储介质

Also Published As

Publication number Publication date
IT1315917B1 (it) 2003-03-26
ITRM20000248A0 (it) 2000-05-10
ITRM20000248A1 (it) 2001-11-10
AU5875201A (en) 2001-11-20

Similar Documents

Publication Publication Date Title
WO2001086633A1 (fr) Detection d&#39;activite vocale et d&#39;extremite de mot
EP1766615B1 (fr) Systeme et procede pour extension de largeur de bande artificielle amelioree
US8554560B2 (en) Voice activity detection
JP3197155B2 (ja) ディジタル音声コーダにおける音声信号ピッチ周期の推定および分類のための方法および装置
AU763409B2 (en) Complex signal activity detection for improved speech/noise classification of an audio signal
EP1159732B1 (fr) Recherche de point final d&#39;un discours parle dans un signal bruyant
KR100653932B1 (ko) 음성부호화 방법, 음성부호화 장치와 그를 포함하는 이동통신기기 및 그를 포함하는 셀 방식 전화네트워크
EP0548054B1 (fr) Dispositif de détection de la présence d&#39;un signal de parole
US5812965A (en) Process and device for creating comfort noise in a digital speech transmission system
US6993481B2 (en) Detection of speech activity using feature model adaptation
US8050415B2 (en) Method and apparatus for detecting audio signals
US5937375A (en) Voice-presence/absence discriminator having highly reliable lead portion detection
US7359856B2 (en) Speech detection system in an audio signal in noisy surrounding
JPH08505715A (ja) 定常的信号と非定常的信号との識別
EP1312075B1 (fr) Procede de classification robuste avec bruit en codage vocal
EP0653091B1 (fr) Discrimination entre des signaux stationnaires et non stationnaires
KR100925256B1 (ko) 음성 및 음악을 실시간으로 분류하는 방법
US7254532B2 (en) Method for making a voice activity decision
JPH0449952B2 (fr)
Glotin et al. Test of several external posterior weighting functions for multiband full combination ASR
Gilg et al. Methodology for the design of a robust voice activity detector for speech enhancement
Martin et al. Robust speech/non-speech detection using LDA applied to MFCC for continuous speech recognition
Karpov et al. Combining Voice Activity Detection Algorithms by Decision Fusion
NZ286953A (en) Speech encoder/decoder: discriminating between speech and background sound

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP