CA1127764A

CA1127764A - Speech recognition system

Info

Publication number: CA1127764A
Application number: CA318,538A
Authority: CA
Inventors: Hiroaki Sakoe
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1977-12-27
Filing date: 1978-12-22
Publication date: 1982-07-13

Abstract

Abstract of the Disclosure In a speech recognition system for programming or commanding computers and other machines, there is described a recognition system which is less susceptible to breathing and other background noises. Speech-like signal durations are detected from input signal waves from all sources. The waveforms of the input signals are analyzed and recognition parameters are extracted. A spectral change detecting unit detects the magnitude of short-time spectral changes from the speech-like signal durations, determining thereby whether the speech-like signal is speech. A recognition unit re-cognizes speech patterns from signals within the speech-like signal durations and rejects those signals which the spectral-change detecting unit indicates do not contain speech. The parameter extracting means and the spectral change detecting unit receive controlling signals from a control unit.

Description

:

~12'~764 This invention relates to the improvement in a speech recognition system.
Speech recognition systems are useful not only as data input means for computers but also as command means for various machines. They are ac-tually used, for instance, as a means to feed routing information into auto-matic package sorting machines or as means to feed inspection data into com-puters used in automobile factories and elsewhere as described by momas B.
Hartin in his article entitled "Practical Applications of Voice Input Maohines" published in the Proceeding of the IEEE, Vol. 64. No. 4, April issue, 19~6, pp. 487-501 (Reference 1).
One of the advantages resulting from the use of a speech recogni-tion system is that the operator can input information while simultaneously doing something else with his hands and/or feet. If a wireless microp~hone is used, he can operate the system even while waIking around. These ad-vantages, which are not pctssible with manually operated conventional data input means, such as typewriters, are unique to 1 speech recognition systems.
However, if one tries to orally input data while doing something else, breathing noises inevitably become louder and begin to adversely effect the function of the speech recognition system. In a speech recogni-20 tion system, speech signal durations are detected by monitoring the ampli-tude levels of a sound signal which is picked up and converted into electrical signals by a microphone. Durations of sound signals where the amplitude levels are higher than a predetermined threshold value are detected as speech signal durations to be recognized and treated as such. However, when breathing noise is so loud that its amplitude level exceeds the threshold value, it will be detected as a speech signal duration. Accordingly, it is ., .: ~
, ; ,:
.,:~ - :

`"` ~ 764 desirable that breathing noises erroneously detected be rejected as not being a real speech signal. There are other instances, however, where other types of noise, including common background noises, are mistaken for real speech signals having some meaning ~hereinafter referred to as simply "speech signal" or "speech"), and which noise signals may also cause faulty operation of the speech recognition system.
An object of the precent invention is to provide a speech recognition system which is less vulnerable to breathing or background noise.
The present system comprises a speech detecting unit for detecting speech-like signal durations from input signal waves; means for analyzing the waveforms of said input signal waves to extract recognition parameters; a spectral-change tetecting unit for detecting the magnitude of short-time spectral changes in said speech-like signal durations and deciding whether or not the integrated value of a short-time spectral change exceeds a threshold value to thereby determine whether or not each speech-like signal duration is speech and to deliver its output signal if the threshold value is exceeded; a recognition unit which recognizes on the basis of said recognition parameters speech patterns from signals fed from within said speech-like signal durations and which rejects some recognition results on the basis of the output signal received from said spectral-change detecting unit which indicates that said speech-like signal duration does not include speech; and a control unit for supplying control signals to said parameter extracting means and spectral-change detecting unit.
The present system alternatively comprises a speech detecting unit for detecting speech-like signal durations from input signal waves; means for analyz-ing the waveforms of said input signal waves to extract recognition parameters;
a voiced-speech duration detecting unit for determining whether or not voiced -speech durations are present in each speech-like signal duration and to thereby deliver the determination results as confirmation signals; and a recognition unit which recognizes, on the basis of $

, . .

~J; - 2a-1127~764 said reCognitiOn parameters, speech patterns from signals input within said speech-like signal durations and which rejects such recognition result on the basis of the confirmation signal received from said voiced-speech duration detecting unit.
The present invention will be described in detail in conjunction with the accompanying drawing in which:
Figure 1 is a block diagram of a first embodiment of this invention;
Figure 2 illustrates in further detail a part of the first embodi-ment;
Figure 3 is a time chart illustrating the operation of the circuit of Figure 2;
Figures 4 and 5 are block diagrams of other examples of a spectral-change detecting unit of Figure l;
Figure 6 is a graphic representation of the spectrum pattern of voiced speech;
Figure 7 is a graphic representation of a typical shorttime auto-correlation function of voiced speech;
Figure 8 is a block diagram of a second embodiment of this inven-tion;
Figure 9 shows in detail a block diagram of a voiced-speech dura-tion detecting unit;
Figure 10 is a waveform illustrating the operation of the peak detecting unit of Figure 9;
Figures 11 (a) and (b) are a series of waveforms illustrating the recognizing operation of the second embodiment;
Figure 12 is a more detailed drawing of the peak detecting unit shown in Figure 9; and Figures 13 and 14 are flow charts of the operation of the recogni-~27764 tion unit used in the first and second embodiments.
A first embodiment of this invention will now be described, which uses spectrum information as a reference for determining whether a given input signal is speech or noise.
Referring to Figure 1, the present system is composed of a speech detecting unit 20 for detecting speech-like signal (referred to as "SLS"
hereunder) durations from input signal waves, means 10 for analyzing the waveforms of said input signal waves to extract recognition parameters, a recognition unit 30 for recognizing speech signals in the detected SLS dura-tions, means 40 for determining whether or not each of the SLS durations is a speech signal duration by examining the variation of short-time spectrum information in said SLS duration and by supplying the determination result to said recognition unit, and a pulse generator 5 for supplying control signals to said means 10 and 40.
Short-time spectrum information mentioned above is not limited to what short-time spectrum is usually considered to include, but also includes such parameters as short-time autocorrelation co-efficients, short-time spectrum or linear predictive co-efficients which are similar to short-time spectrum co-efficients, and other parameters such as formant frequencies which are closely related to short-time spectrumO
Meaningful words such as geographical or personal names are combinations of consonants and vowels, and their short-time spectra usually manifest significant temporal changes. In contrast, breathing noises are generated by the friction of air within the respiratory system, and the short-time spectral change of their duration is comparatively smallO Back-ground noises including the sounds of motors, welding or winds have relatively constant spectra. Therefore, the adverse effect of breathing and background noises can be effectively eliminated by the present speech recognition - 4 _ ~, .

1~;~4 , System which comprises built-in means to determine short-time spectral changes.
Returning to Figure 1, an input analog sound signal s is subjected to, for instance, 10-channel short-time frequency analysis by the spectrum analyzer 10. An example of such an analyzer would be one composed of 10 band pass filters, 10 rectifiers, 10 smoothing filter circuits, a multiplexer and analog-to-digital converter so that the spectrum envelope of the input signal s can be described in 10 digital parameters.. Now, the result of short-time spectral analysis at the point of time i is represented by a 10-dimensional vector of i (all' a2i' -- -- ani ...... -- alOi) .... (1) Digital vector signals like Equation ~1) are supplied by the spectrum analyzer 10 at predetermined frame intervals ~of 10 milliseconds, for instance). The analyzer 10 may be the one illustrated in Figure 1 of the article entitled "Real-Time Recognition of Spoken Words" by Louis C.W. Pols, published in IEEE Transactions on Computers, Vol. C-20, No. 9, September issue, 1971 ~Reference 2).
Thus, the result of the spectral analysis is multiplexed by said multiplexer synchronously with a frame sync pulse~ 1 generated by the control unit 5 of Figure 1, and is converted into a digital signal by the analog-to-digital converter. Incidentally, in every drawing hereinafter referred to, thick lines represent the paths for 12-bit parallel binary signals and thin lines represent either analog signals or one-bit binary signals. Signal paths and signals may at times be represented by the same terms.
The analog input signal s is fed to the speech detecting unit 20, which is of the type described in U.S. Patent 3,712,959 which issued to Ettore Fariello on January 23, 1973 ~Reference 3). The unit 20 picks out as an SLS duration a time segment that can be regarded as a sound signal by determining the aMplitude, zero-crossing and other characteristics of the -~.~,~764 input signal s. The digital output signal _ of the unit 20 is supposed to have a value "1" within the SLS duration and a value "O" otherwise. The time indicator i is hereinafter counted on the premise that it bscomes equal to "1" at the starting point of the speech-like signal duration. As a result, signals in the SLS duration are a time series of the vector ai represented by:
A = al, a2, ... , ai, ... , aI ............... (2) where I stands for the length of this SLS duration. Hereinafter, the signals represented by Equation (2) will be called the input pattern A.
The recognition unit 30 functions to recognize the signals in the SLS duration designated by the signal ~ from the detecting unit 20, i.e., the input pattern A of Equation (2), and determines the word re-presented by the pattern. Though many different recognition principles have been proposed for this recognition unit 30, any of them is applicable to this invention. One conceivable example is the known pattern matching method. By this method, a set of words to be recognized are determined in advance, and the individual words in suitable parameters are stored as reference patterns. As an SLS duration is detected and inputed, it is described in the parameters thereby forming an input pattern. This input pattern is subjected to pattern matching operation, i.e., compared with said reference patterns and the reference pattern closest to the input pattern is selected so that the input pattern can be identified with the word represented by the selected reference pattern. The recognition result is given as an output signal n. Said recognition unit 30 can be composed in the same way as the MINICOMPUTER section in Figure 5 on page 492 of Reference 1, The spectral-change detecting unit 40 calculates the amount of temporal change in the short-time spectrum signal supplied by the spectrum !' - 6 -, .
~ ,.

,, ~ . .

-detecting unit 10, i.e., the vector ai in Equation (2). If the total transition is the SLS duration, i.e., while the value of i varies from 1 to I, is greater than a predetermined threshold value, this SLS duration is regarded as speech to generate a detection signal q.
Next, the details of the spectral-change detecting unit 40 are described referring to a time chart shown in Figure 3. The vector ai output from the spectrum analyzer 10 synchronously with the frame sync pulse ~ 1 - is stored in a first register 41. Simultaneously the vector ai I which had been stored in the first register 41 until immediately before is shifted to a second register 42. A vector-to-vector distance calculating unit 43 calculates, as unit-time transition, the distance between the vector ai stored by the first register 41 and the vector ai I stored by the second register 42, and gives a resultant signal d as output. Though different definitions of the distance may be given, the Euclidean distance is used herein, This unit-time transition d is integrated by the integrator 44 synchronously with a pulse signal ~ 2 which has the same frequency as said frame sync pulse ~ 1 and is behind it in phase.
As soon as the signal p of said speech detecting unit 20 has changed from 0 to 1, i.e., at the starting time of an SLS duration, a pulse generating circuit 46 generates a starting pulse Pl- The integrator 44 is re5et to "0" in response to the starting pulse Pl- Next, the distance dl between the vectors ai and ai 1 in the SLS duration, i.e., the short-time spectral change, is integrated by the integrator 44. When the SLS duration has been terminated, i.e., when said signal p has changed from "1" to "0", the pulse generating circuit 46 generates an ending pulse P2. The inte-grated value D retained by said integrator 44 is compared at a comparator circuit 45 with a predetermined threshold value ~. It is supposed that when D is greater than~, the output signal k takes on the value "l" and ~Z7764 when D either is smaller than or equals 0, the output signal k takes on the value "1" and when D either is smaller than or equals ~, k is held to "0".
The logical product of ~his signal k and said ending pulse p2 is calculated at an AND gate 47, and is supplied as the detecting signal q to the minicom-puter unit built into the recognition unit 30. Incorporated into the mini-computer unit included within recognition unit 30 is a program which functions to output the recognition result signal n when triggered by the input of the detection signal qsas indicated in Figure 13. me program also functions to reject the recognition result, when no pulse q is input, thereby preventing the output of a signal nO
Thus, a speech recognition system which is non-responsive to breathing or background noise is achieved based on the short-time spectral change in the SLS duration.
In the above-mentioned embodiment, since the spectral change is evaluated in terms of the time-integrated value of the unit-time transition d, the spectral change tends to increase with the length of the S~S duration. The short time spectrum of breathing or background noise is always changing, through only slighly, on a unit-time basisO
Therefore, if breathing or background noise continue for a long time, the integrated value of its unit-time transitions d tends to become greater and the noise may be mistaken for speech.
Referring to Figure 4 which illustrates another structure of the spectral-change detecting unit free from said shortcoming, the first register 41~ seeond register 42, vector-to-vector distance calculating unit 43, integrator 44, comparator circuit 45, pulse generating circuit 46 and AND
gate 47 function similarly to their respective counterparts in Eigure 2.
An additional counter 48 is reset by the starting pulse ~p output from said pulse generating circuit 46, and then counts the frame sync pulses~ 1. As a result, at the time when an SLS duration is terminated and the ending pulse , ~

~64 P2 is generated, a quantity ~, proportional to the time length of the SLS
duration, is stored in the counter 48. The integrated value D of the unit-time transitions d stored in the integrator 44 is divided by the time length signal,e counted by the counter 48. The average transition D' thus obtained is supplied to the comparator circuit 45 to be compared with the threshold value~.
In this way, a spectral change averaged over a period of time is used as the basis of distinction, thereby preventing continuing breathing or background noises from being mistaken for speech. Incidentally, said calculating unit 43 is composed of a subtractor to obtain the difference between the input vectors ai and ai 1 and an integrator to integrate the absolute value of this difference.
However, if spectral changes are detected, in terms of the average of unit-time transitions over a long period of time, changes which are relatively great in particular parts are averaged with others, resulting in a failure of the speech detection. For example, although the spectrum changes relatively significantly between /i:/ and /z/ in the word "ease"
[i:Z], the overall spectral change is not so great because the major part of the speech duration consists of the sustaining vowel /i:l. Consequently, the speech may be rejected.
Figure 5 shows another spectral-change detecting unit 40 designed to overcome this problem. In the figure, the structural elements 41 through 49 are the same as their respective counterparts in Figure 4. The unit-time spectral transition d is given to an additional comparator circuit 50 to be compared with a predetermined threshold value ~. An output signal _ of the circuit 50 is "1~' whenever the transition d is greater than the threshold value ~; otherwise it is "0". A set-reset type flip-flop 51 is reset at the beginning of an SLS duration by a starting pulse Pl generated by the pulse ~Z7~

generating circuit 46. Whenever said transition d exceeds the threshold value~, the output of the circuit 50 becomes "1" and consequently the flip-flop 51 is set, giving an output signal k' to give "l". This signal k' is led by way of an OR circuit 52 to the AND gate 47. Therefore, if in an SLS
duration there is even one point of time where the spectral change is great, ; a detection signal q is generated. Thus, the spectral change detection with high sensitivity can be achieved by the use of the circuit of Figure 5.
The above description of the present invention with reference to an embodiment thereof is not intended to limit the applicable range of this invention. In particular, it is possible to improve the detecting performance by inserting a low pass filter having a suitable time constant between the vector-to-vector distance calculating unit 43 and the threshold value circuit 50 of Figure 5. Also, it is obvious that a combination of the spectral-change detecting units of Figures 2 and 5 can be effectively used to comprise this invention. Although the examples of Figures 3 through 5 are composed of digital circuits, analog circuits can as well be used to achieve the same function.
A second embodiment of the present invention will now be described ~hich distinguishes between speech and noise by determining whet~er or not an SLS duration detected by the speech detecting unit 20 includes a voiced component. Voiced sounds here refer to sounds generated by excitation of the vocal tract by the oscillating wave of the vocal cords, and the usual vowels and nasal phonemes are all voiced sounds. In contrast, unvoiced sounds are excited by the friction or plosion of air flow in the voice tract and do not accompany the vibration of the vocal cords.
Because all meaningful words, such as nUmerals, geographical names and personal names, contain vowels, any usual vocal signal contains a voiced duration. On the other hand, breathing noise generated by the friction of 1~764 air flow in the mouth or nostrils is essentially unvoiced. Accordingly, by detecting the presence or absence of a voiced speech duration in an SLS
duration, it is possible to determine whether the duration is speech or breathing noise. The spectrum of a voiced sound excited by the vibration of the vocal cords has a harmonic structure. As illustrated in Figure 6, the structure has the frequency of vocal cords vibration as its fundamental frequency (known as pitch frequency). In connection with this fact, the shor~-time autocorrelation function of a voiced sound, as indicated in Figure 7, has a relatively high peak corresponding to the period of vocal cords vibration or the pitch period. The spectrum of usual indoor noise in which correlation is almost absent, has no harmonic structure, and no conspicuous peak is observed in its autocorrelation function. Thus, non-correlated back-ground noise closely resembles unvoiced sounds, and accordingly, can be distinguished from meaningful sounds. Background noise does sometimes in-clude sounds having a harmonic structure such as those resulting from the revolution of a motor. Such components of background noise can be distin-guished from speech to some extent. The pi~ch frequencies of normal human voices are known to be within the range of lOOHz to 350Hz. Therefore, if the pitch frequency of an input signal lies outside of this range, it is different from a voiced sound in the usual sense of the term.
Figure 8 is a block diagram of a second embodiment of this inven-tion based on the above-explained principle. In Figure 8, the same reference ~umerals represent the same structural elements as in Figure 1, respectively.
Segments of the analog input wave s whose amplitude levels are higher than a predetermined threshold level are detected by the speech detecting unit as SLS durations. For each of these SLS durationsj the digital detect~r signal ~ is set to "1", and is reset to "0" in response to the termination of the duration.

- , .

~ '. ~ ' .

, ,, ~Z~4 The recognition system 2 which includes the spectrum analyzer 10 and recognition unit 30 recognizes the SLS durations in said signal wave ~ s, i.e., the segments where the detection signal _ is "1", and determines - the recognition results. A voiced-speech duration detecting unit 3 detects the presence or absence of a voiced sound in each of the SLS durations in the signal wave s, i.e., the segments where the detection signal p is "1", and gives a confirmation signal q' of "1" or ~oi. corresponding to the presence or absence of a voiced signal.
Figure 9 is block diagram of the detecting unit 3 based on the autocorrelation method of various detecting methods. Imme-liately after the detection signal ~ of the detecting unit 20 has risen from "0" to "1", a fiip-flop 34 is reset to "0". The input signal wave s is low-passed by an analog low-pass filter 31 having a cut-off frequency of 350Hz, and supplied to an autocorrelator 32. This autocorrelator 32 may be composed of the analyzer illustrated in Figure 8 of the article by M.R. Schroeder entitled "Vocoders: Analysis and Synthesis of Speech", published in the Proceedings of The IEEE, Vol 54, No. 5 (May issue, 1966) (Reference 4). This auto-correlator 32 calculates the short-time autocorrelation function of the input signal. The short-time autocorrelation function defined as follows:

~t_as(t) s~t-~)dt (3) where 5(t) represents the value of the input signal wave at the time t.
delta ~, the integrated length of time; and~ the delay time. Since the pitch freq~ency normally ranges between lOOHz (hertz) and 350Hz, the pitch period is from 2.9 to lO milliseconds. Accordingly, the autocorrelation function of Equation (3) is calculated with respect to the delay time Z
~ithin the range of

2.9 ' ~ 10 (milliseconds) (4) ,.

llZ7764 This short-time autocorrelation function is calculated at every point time t (at a sample point of time in actual practice), and is fed as a signal x to a peak detecting unit 33. A multiplexer is built into the output sec-tion of the autocorrelator 32 which scans the autocorrelation function~ tt,t) with respect to the delay time ~ and outputs its result. Therefore, the input signal x of the peak detecting unit 33 has the waveform of Figure 10.
Referring to Figure 12, which shows the detailed structure of peak detecting unit 33, the input signal x is divided by resistors Rl and R2, and supplied to an AND gate 120 as a signal y. The resistances of the resis-tors Rl and R2 are so set that the level of the signal y may become equal to the turn-on threshold value of the AND gate 120 when the level of the input signal x becomes equal to e. A pulse series signal v is inputted through a signal line 121, and a signal which becomes "1" within the range defined in Equation (4) is given through a signal line 122. The circuit of Figure 12 allows a pulse to be supplied as a peak detection signal _ only when the delay time ~is within the range defined in Equation (4) and the input signal x has exceeded the threshold value ~. Therefore, if the thres-hold value e is appropriately set, a pulse signal is supplied as the peak detection signal m only when the input sound signal s is a voiced sound.
The -flip-flop 34 of Figure 9 is set to "1" by this pulse. Thus, if at least one voiced part is present in an SLS duration, the output of the flip-flop 34, namely the confirmation signal q' assumes the value "1". Conversely, if said short-time autocorrelation function ~ (t,~ ) does not exceed the threshold value ~, the pulse of the peak detecting signal _ is not generated and consequently the flip-flop 34 remains at "0". Therefore, when an SLS
duration is terminated, the detection signal assumes the value "0", and a volced part is present in this SLS duration, the flip-flop 34 is in a set state and the confirmation signal q' becomes "1". If no voiced component is present, the 1ip-flop 34 remains reset and the confirmation signal ~' ~ .

`~ :

llZ7764 remains "0".
The program shown in Figure 14 is lioaded into the minicomputer used in the recognition unit 30 of Figure 8. The minicomputer receives the confirmation signal q' after an SLS duration is terminated and said detection signal ~ becomes "0". If q' is "1", this SLS duration is judged to be speech, and the results of this recognition are fed to a signal line _.
If q' is "0", this SLS duration is judged to be noise such as breathing noise, and a rejecting code is fed to the signal line _.
For the timed interrelationship between the detection signal p, the operation of the recognition system and the confirmation signal ~', two processing procedures are adopted as illustrated in Figures 11 ~a) and (b).
In accordance with the time chart of Figure 11 ~a), the recognition unit performs the recognizing procedure as soon as an SLS duration is detected and the detection signal p is "1". Soon after the speech-like signal dura-tion is terminated and the recognizing procedure is completed ~the confir-mation signal q' turns "1"), the recognition result is fed into the signal line _.
Referring to the time chart of Figure 11 ~b), the recognizing procedure is performed only in the event that the confirmation signal q' is "1" when an SLS duration is terminated and the detection signal p turns "0".
The recognition reault is given as soon as the recognizing procedure is completed. If the confirmation signal q' is "0" at the end of an SLS dura-tion, no recognizing procedure is performed.
The procedure of Figure 11 ~a) has the advantage of being able to be performed by a low-speed operation circuit because it permits a com-paratively long time to be taken for the recognizing procedure.
The procedure o Figure 11 (b), which is allowed little time for the recognizing procedure when the recognition result has to be supplied 11Z~764 promptly after the termination of a voice duration, requires a recognition unit composed of a high-speed operation circuit. However, when the con-firmation signal q' is "0", i.e., a given SLS duration has been determined not to be speech, there is no need to operate the recognition unit. mis permits such blank periods of the recognition unit to be utili~ed commonly by more than one recognition system.
Although the present invention has hitherto been described with reference to the embodiments, the structure of the voiced-speech duration detecting unit 3 is not limited to that shown in Figure 4. It is possible to distinguish between voiced and unvoiced sounds, for instance, depending on the ratio between the spectrum of a range where a pitch is present (100 Hz - 350 Hz) and that of a whole range (100 Hz - 6,000 Hz, for example3.
This method contributes to simplification of a speech recognizing system using a filter bank analysis. The recognition result given in a speech-like signal duration without speech can be outputted either by giving no signal at all or by giving a rejecting code.

,, , ; .
.

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A speech recognition system comprising: a speech detecting unit for detecting speech-like signal durations from input signal waves; means for analyzing the waveforms of said input signal waves to extract recognition parameters; a spectral-change detecting unit for detecting the magnitude of short-time spectral changes in said speech-like signal durations and deciding whether or not the integrated value of a short-time spectral change exceeds a threshold value to thereby determine whether or not each speech-like signal duration is speech and to deliver its output signal if the threshold value is exceeded; a recognition unit which recognizes on the basis of said recognition parameters speech patterns from signals fed from within said speech-like signal durations and which rejects some recognition results on the basis of the output signal received from said spectral-change detecting unit which indicates that said speech-like signal duration does not include speech; and a control unit for supplying control signals to said parameter extracting means and spectral-change detecting unit.

2. A speech recognition system comprising: a speech detecting unit for detecting speech-like signal durations from input signal waves: means for analyzing the waveforms of said input signal waves to extract recognition parameters; a voiced-speech duration detecting unit for determining whether or not voiced-speech durations are present in each speech-like signal duration and to thereby deliver the determination results as confirmation signals; and a recognition unit which recognizes, on the basis of said recognition parameters, speech patterns from signals input within said speech-like signal durations and which rejects such recognition result on the basis of the confirmation signal received from said voiced-speech duration detecting unit.