WO2005029463A9 - A method for recovering target speech based on speech segment detection under a stationary noise - Google Patents

A method for recovering target speech based on speech segment detection under a stationary noise

Info

Publication number
WO2005029463A9
WO2005029463A9 PCT/JP2004/012899 JP2004012899W WO2005029463A9 WO 2005029463 A9 WO2005029463 A9 WO 2005029463A9 JP 2004012899 W JP2004012899 W JP 2004012899W WO 2005029463 A9 WO2005029463 A9 WO 2005029463A9
Authority
WO
WIPO (PCT)
Prior art keywords
noise
speech
estimated
spectrum series
target speech
Prior art date
Application number
PCT/JP2004/012899
Other languages
French (fr)
Other versions
WO2005029463A1 (en
Inventor
Hiromu Gotanda
Keiichi Kaneda
Takeshi Koya
Original Assignee
Kitakyushu Foundation
Univ Kinki
Hiromu Gotanda
Keiichi Kaneda
Takeshi Koya
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kitakyushu Foundation, Univ Kinki, Hiromu Gotanda, Keiichi Kaneda, Takeshi Koya filed Critical Kitakyushu Foundation
Priority to US10/570,808 priority Critical patent/US7533017B2/en
Publication of WO2005029463A1 publication Critical patent/WO2005029463A1/en
Publication of WO2005029463A9 publication Critical patent/WO2005029463A9/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the detection judgment criteria define the speech segment as a frame-number range where the total sum F is greater than the threshold value ⁇ and the noise segment as a frame-number range where the total sum F is less than or equal to the threshold value ⁇ . Accordingly, a speech segment detection function, which is a two-valued function for selecting either the speech segment or the noise segment depending on the threshold value ⁇ , can be defined. By use of this function, components falling in the speech segment can be easily extracted.
  • the method for recovering target speech based on speech segment detection under a stationary noise comprises: the first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and forming mixed signals at a first microphone and at a second microphone, which are provided at separate locations, performing the Fourier transform of the mixed signals from the time domain to the frequency domain, and extracting estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Independent Component Analysis; the second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of estimated spectrum series in Y*; the third step of detecting a speech segment and a noise segment in the time domain of the total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a predetermined threshold value ⁇ that is determined
  • a target speech recovering apparatus 10 which employs a method for recovering target speech based on speech segment detection under a stationary noise according to the first and second embodiments of the present invention, comprises two sound sources 11 and 12 (one of which is a target speech source and the other is a noise source, although they are not identified), a first microphone 13 and a second microphone 14, which are provided at separate locations for receiving mixed signals transmitted from the two sound sources, a first amplifier 15 and a second amplifier 16 for amplifying the mixed signals received at the microphones 13 and 14 respectively, a recovering apparatus body 17 for separating the target speech and the noise from the mixed signals entered through the amplifiers 15 and 16 and outputting recovered signals of the target speech and the noise, a recovered signal amplifier 18 for amplifying the recovered signals outputted from the recovering apparatus body 17, and a loudspeaker 19 for outputting the amp
  • Equation (1) when the signals from the sound sources 11 and 12 are convoluted, it is difficult to separate the signals s ⁇ (t) and s 2 (t) from the mixed signals x ⁇ t) and x 2 (t) in the time domain. Therefore, the mixed signals x ⁇ t) and x 2 (t) are divided into short time intervals (frames) and are transformed from the time domain to the frequency domain for each frame as in Equation (2):
  • the spectrum v ⁇ ( ⁇ ,k) generated at the node 1 represents the signal spectrum s 2 ( ⁇ ,k) transmitted from the sound source 12 and observed at the first microphone 13
  • the spectrum v ⁇ 2 ( ⁇ ,k) generated at the node 1 represents the signal spectrum s 2 ( ⁇ ,k) transmitted from the sound source 12 and observed at the second microphone 14
  • the spectrum v 21 ( ⁇ ,k) generated at the node 2 represents the signal spectrum s ⁇ ( ⁇ ,k) transmitted from the sound source 11 and observed at the first microphone 13
  • the spectrum v 22 ( ⁇ ,k) generated at the node 2 represents the signal spectrum sj( ⁇ ,k) transmitted from the sound source 11 and observed at the second microphone 14.
  • FIG. 6 shows the amplitude distribution of the estimated spectrum series in FIG. 4; and FIG. 7 shows the amplitude distribution of the estimated spectrum series in
  • the frame-number range characterizing speech varies from an estimated spectrum series to an estimated spectrum series in y*.
  • the frame-number range characterizing the speech can be clearly defined.
  • An example of the total sum F of all the estimated spectrum series in y* is shown in FIG. 8, where each amplitude value is normalized by the maximum value (which is 1 in FIG. 8).
  • a visual inspection on the waveform of the target speech signal recovered from the estimated spectra Y* was carried out to visually determine the start and end points of the speech segment.
  • the comparison between the two methods revealed that the start point of the speech segment determined according to the present method was -2.71msec (with a standard deviation of 13.49msec) with respect to the start point determined by the visual inspection; and the end point of the speech segment determined according to the present method was -4.96msec (with a standard deviation

Abstract

Method for recovering target speech by extracting signal components falling in a speech segment, which is determined based on separated signals obtained through the Independent Component Analysis, thereby minimizing the residual noise in the recovered target speech. The present method comprises: the first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and extracting estimated spectra Y* corresponding to the target speech by use of the Independent Component Analysis; the second step of separating from the estimated spectra Y* an estimated spectrum series group y* in which the noise is removed by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of estimated spectrum series in Y*; the third step of detecting a speech segment and a noise segment of the total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a predetermined threshold value T that is determined by the maximum value of F; and the fourth step of extracting components falling in the speech segment from the estimated spectra Y* to generate a recovered spectrum group of the target speech for recovering the target speech.

Description

DESCRIPTION A METHOD FOR RECOVERING TARGET SPEECH BASED ON SPEECH SEGMENT DETECTION UNDER A STATIONARY NOISE
CROSS REFERENCE TO RELATED APPLICATIONS This apphcation claims priority under 35 U.S.C. 119 based upon Japanese
Patent Apphcation No. 2003-314247, filed on September 5, 2003. The entire disclosure of the aforesaid apphcation is incorporated herein by reference.
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for recovering target speech based on speech segment detection under a stationary noise by extracting signal components falling in a speech segment, which is determined based on separated signals obtained through the Independent Component Analysis (ICA), thereby minimizing the residual noise in the recovered target speech.
2. Description of the Related Art Recently the speech recognition technology has significantly improved and achieved provision of speech recognition engines with extremely high recognition capabilities for the case of ideal environments, i.e. no surrounding noises. However, it is still difficult to attain a desirable recognition rate in a household environment or offices where there are sounds of daily activities and the like. In order to take advantage of the inherent capability of the speech recognition engine in such environments, preprocessing is needed to remove noises from the mixed signals and pass only the target speech such as a speaker's speech to the engine. In this respect, the ICA and other speech emphasizing methods have been widely utilized and various algorithms have been proposed. (For example, see the following five references: 1. "An Information Maximization Approach to Blind Separation and Blind Deconvolution", by J. Bell and T. J. Sejnowski, Neural Computation, USA, MIT Press, June 1995, Vol. 7, No. 6, pp 1129 -1159; 2. "Natural Gradient Works Efficiently in Learning", by S. Amari, Neural Computation, USA, MIT
Press, February 1998, Vol. 10, No. 2, pp. 254-276; 3 "Independent Component Analysis Using an Extended Informax Algorithm for Mixed Sub -Gaussian and Super-Gaussian Sources", by T. W. Lee, M. Girolami, and T. J. Sejnowski, Neural Computation, USA, MIT Press, February 1999, Vol. 11, No. 2, pp. 417-441; 4. "Fast and Robust Fixed- Point Algorithms for Independent Component Analysis", by A. Hyvarinen, IEEE Trans. Neural Networks, USA, IEEE, June 1999, Vol. 10, No. 3, pp. 626 -634; and 5. "Independent Component Analysis: Algorithms and Applications" , by A. Hyvarinen and E. Oja, Neural Networks, USA, Pergamon Press, June 2000, Vol. 13, No. 4-5, pp. 411-430.) Among various algorithms, the ICA is a method for separating noises from speech on the assumption that the sound sources are statistically independent. Although the ICA is capable of separating noises from speech well under ideal conditions without reverberation, its separation ability greatly degrades under real-life conditions with strong reverberation due to residual noises caused by the reverberation.
SUMMARY OF THE INVENTION In view of the above situations, the objective of the present invention is to provide a method for recovering target speech from signals received in a real -life environment. Based on the separated signals obtained through the ICA, a speech segment and a noise segment are defined. Thereafter signal components falling in the speech segment are extracted so as to minimize the residual noise in the recovered target speech. According to a first aspect of the present invention, the method for recovering target speech based on speech segment detection under a stationary noise comprises: the first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and forming mixed signals at a first microphone and at a second microphone, which are provided at separate locations, performing the Fourier transform of the mixed signals from the time domain to the frequency domain, and extracting estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Independent Component Analysis; the second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of estimated spectrum series in Y*; the third step of detecting a speech segment and a noise segment in the frame number domain of the total sum\£s»f all the estimated spectrum series in y* by applying detection judgment criteria based on a predetermined threshold value β that is determined by the maximum value of F; and the fourth step of extracting components falling in the speech segment from each of the estimated spectrum series in Y* to generate a recovered spectrum group of the target speech, and performing the inverse Fourier transform of the recovered spectrum group from the frequency domain to the time domain to generate a recovered signal of the target speech. Hie target speech and noise signals received at the first and second microphones are mixed and convoluted. By transforming the signals &om the time domain to the frequency domain, the convoluted mixing can be treated as instant mixing, making the separation procedure relatively easy. In addition, the sound sources are considered to be statistically independent; thus, the ICA can be employed. Since split spectra obtained through the ICA contain scaling ambiguity and permutation at each frequency, it is necessary to solve these problems first in order to extract the estimated spectra Y* and Y corresponding to the target speech and the noise respectively. Even after that, the estimated spectra Y* at some frequencies still contain the noise. There is a well known difference in statistical characteristics between speech and a noise in the time domain. That is, the amplitude distribution of speech has a high kurtosis with a high probability of occurrence around 0, whereas the amplitude distribution of a noise has a low kurtosis. The same characteristics are expected to be observed even after performing the Fourier transform of the speech and noise signals from the time domain to the frequency domain. At each frequency, a plurality of components form a spectrum series according to the frame number used for discretization. Therefore, by examining the kurtosis of the amplitude distribution of the estimated spectrum series in Y* at one frequency, it can be judged that, if the kurtosis is high, the noise is well removed at the frequency; and if the kurtosis is low, the noise still remains at the frequency. Consequently, each spectrum series in Y* can be assigned to either the estimate spectrum series group y* or y. Since the frequency components of a speech signal varies with time, the frame-number range characterizing speech varies from an estimated spectrum series to
3 RECTIFIED SHEET RULE 91) an estimated spectrum series in y*. By taking a summation of all the estimated spectrum series in y* at each frame number and by specifying a threshold value β depending on the maximum value of F, the speech segment and the noise segment can be clearly defined in the frame-number domain. Therefore, noise components are practically non-existent in the recovered spectrum group, which is generated by extracting components falling in the speech segment from the estimated spectra Y*. The target speech is thus obtained by performing the inverse Fourier transform of the recovered spectrum group from the frequency domain to the time domain. It is preferable that the detection judgment criteria define the speech segment as a frame-number range where the total sum F is greater than the threshold value β and the noise segment as a frame-number range where the total sum F is less than or equal to the threshold value β. Accordingly, a speech segment detection function, which is a two-valued function for selecting either the speech segment or the noise segment depending on the threshold value β, can be defined. By use of this function, components falling in the speech segment can be easily extracted. According to a second aspect of the present invention, the method for recovering target speech based on speech segment detection under a stationary noise comprises: the first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and forming mixed signals at a first microphone and at a second microphone, which are provided at separate locations, performing the Fourier transform of the mixed signals from the time domain to the frequency domain, and extracting estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Independent Component Analysis; the second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of estimated spectrum series in Y*; the third step of detecting a speech segment and a noise segment in the time domain of the total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a predetermined threshold value β that is determined by the maximum value of F; and the fourth step of performing the inverse Fourier transform of the estimated spectra
4 E TIFIED SHE RULE 91 Y* from the frequency domain to the time domain to generate a recovered signal of the target speech and extracting components falling in the speech segment from the recovered signal of the target speech to recover the target speech. At each frequency, a plurality of components form a spectrum series according to the frame number used for discretization. There is a one-to-one relationship between the frame number and the sampling time via the frame interval. By use of this relationship, the speech segment detected in the frame-number domain can be converted to the corresponding speech segment in the time domain. The other time interval can be defined as the noise segment. The target speech can thus be recovered by performing the inverse Fourier transform of the estimated spectra Y* from the frequency domain to the time domain to generate the recovered signal of the target speech and extracting components falling in the speech segment from the recovered signal in the time domain. It is preferable that the detection judgment criteria define the speech segment as a time interval where the total sum F is greater than the threshold value β and the noise segment as a time interval where the total sum F is less than or equal to the threshold value β. Accordingly, a speech segment detection function, which is a two- valued function for selecting either the speech segment or the noise segment depending on the threshold value β, can be defined. By use of this function, components falling in the speech segment can be easily extracted. It is preferable, in both the first and second aspects of the present invention, that the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y* is evaluated by means of entropy E of the amplitude distribution. The entropy E can be used for quantitatively evaluating the uncertainty of the amplitude distribution of each of the estimated spectrum series in Y*. In this case, the entropy E decreases as the noise is removed. Incidentally, for a quantitative measure of the kurtosis, μ/σ may be used, where μ is the fourth moment around the mean and σ is the standard deviation. However, it is not preferable to use this measure because of its non-robustness in the presence of outliers. Statistically, a kurtosis is defined as the fourth order statistics as above. On the other hand, entropy is expressed as the weighted summation of all the moments (0th, 1st, 2nd, 3rd ■ ••) by the Taylor expansion. Therefore, entropy is a statistical measure that contains a kurtosis as its part. It is preferable, in both the first and second aspects of the present invention, that the separation judgment criteria are given as: ( 1 ) if the entropy E of an estimated spectrum series in Y* is less than a predetermined threshold value α, the estimated spectrum series in Y* is assigned to the estimated spectrum series group y*; and (2) if the entropy E of an estimated spectrum series in Y* is greater than or equal to the threshold value α, the estimated spectrum series in Y* is assigned to the estimated spectrum series group y.
The noise is well removed from the estimated spectrum series in Y* at some frequencies, but not from the others. Therefore, the entropy varies with ω. If the entropy E of an estimated spectrum series in Y* is less than the threshold value α, the estimated spectrum series in Y* is assigned to the estimated spectrum series group y* in which the noise is removed; and if the entropy E of an estimated spectrum series in Y* is greater than or equal to the threshold value α, the estimated spectrum series in Y* is assigned to the estimated spectrum series group y in which the noise remains. Based on the separation judgment criteria, which determine the selection of y* or y depending on α, it is easy to separate Y* into y* and y. According to the present invention as described in Claims 1, 2, 5, and 6, it is possible to extract signal components falling only in the speech segment, which is determined from the estimated spectra corresponding to the target speech, from the received signals under real-life conditions. Thus, the residual noise can be minimized to recover target speech with high quality. As a result, input operations by means of speech recognition in a noisy environment, such as voice commands or input for OA, for storage management in logistics, and for operating car navigation systems, may be able to replace the conventional input operations by use of fingers, touch censors, or keyboards. According to the present invention as described in Claim 2, it is possible to easily define the frame-number range characterizing the target speech in each estimated spectrum series in Y*; thus, the speech segment can be quickly detected. As a result, it is possible to provide a speech recognition engine with a fast response time of speech recovery under real-life conditions, and at the same time, with high recognition ability.
1 According to the present invention as described in Claim 3, it is possible to extract signal components falling only in the speech segment in the time domain, which is deteimined from the estimated spectra corresponding to the target speech, from the received signals under real-life conditions. Thus, the residual noise can be minimized to recover target speech with high quality. As a result, input operations by means of speech recognition in a noisy environment, such as voice commands or input for OA, for storage management in logistics, and for operating car navigation systems, may be able to replace the conventional input operations by use of fingers, touch censors, or keyboards. According to the present invention as described in Claim 4, it is possible to easily define the time interval characterizing the target speech in the recovered signal of the target speech with the minimal calculation load. As a result, it is possible to provide a speech recognition engine with a fast response time of speech recovery under real-life conditions, and at the same time, with high recognition ability. According to the present invention as described in Claim 5, it is possible to evaluate the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y* even in the presence of outliers. Thus, it is possible to unambiguously select the estimated spectrum series in Y* into y* in which the noise is removed and y in which the noise remains. According to the present invention as described in Claim 6, it is possible to unambiguously select the estimated spectrum series in Y* into y* in which the noise is removed and y in which the noise remains with the minimal calculation load. As a result, it is possible to provide a speech recognition engine with a fast response time of speech recovery under real-life conditions, and at the same time, with high recognition ability.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a target speech recovering apparatus employing the method for recovering target speech based on speech segment detection under a stationary noise according to the first and second embodiments of t he present invention. FIG. 2 is an explanatory view showing a signal flow in which a recovered spectrum is generated from the target speech and the noise per the method in FIG. 1. FIG. 3 is a graph showing the waveform of the recovered signal of the target speech, which is obtained after performing the inverse Fourier transform of the recovered spectrum group comprising the estimated spectra Y*. FIG. 4 is a graph showing an estimated spectrum series in y* in which the noise is removed. FIG. 5 is a graph showing an estimated spectrum series in y in which the noise remains. FIG. 6 is a graph showing the amplitude distribution of the estimated spectrum series in y* in which the noise is removed. FIG. 7 is a graph showing the amplitude distribution of the estimated spectrum series in y in which the noise remains. FIG. 8 is a graph showing the total sum of all the estimated spectrum series in y*. FIG. 9 is a graph showing the speech segment detection function. FIG. 10 is a graph showing the waveform of the recovered signal of the target speech after performing the inverse Fourier transform of the recovered spectrum group, which is obtained by extracting components falling in the speech segment from the estimated spectra Y*. FIG. 11 is a perspective view of the virtual room, where the locations of the sound sources and microphones are shown as employed in the Examples 1 and 2.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention are described below with reference to the accompanying drawings to facilitate understanding of the present invention. As shown in FIG.l, a target speech recovering apparatus 10, which employs a method for recovering target speech based on speech segment detection under a stationary noise according to the first and second embodiments of the present invention, comprises two sound sources 11 and 12 (one of which is a target speech source and the other is a noise source, although they are not identified), a first microphone 13 and a second microphone 14, which are provided at separate locations for receiving mixed signals transmitted from the two sound sources, a first amplifier 15 and a second amplifier 16 for amplifying the mixed signals received at the microphones 13 and 14 respectively, a recovering apparatus body 17 for separating the target speech and the noise from the mixed signals entered through the amplifiers 15 and 16 and outputting recovered signals of the target speech and the noise, a recovered signal amplifier 18 for amplifying the recovered signals outputted from the recovering apparatus body 17, and a loudspeaker 19 for outputting the amplified recovered signals. These elements are described in detail below. For the first and second microphones 13 and 14, microphones with a frequency range wide enough to receive signals over the audible range (10-20000 Hz) may be used. Here, the first microphone 13 is placed more closely to the sound source 11 than the second microphone 14 is, and the second microphone 14 is placed more closely to the sound source 12 than the first microphone 13 is. For the amplifiers 15 and 16, amplifiers with frequency band characteristics that allow non-distorted amplification of audible signals may be used. The recovering apparatus body 17 comprises A/D converters 20 and 21 for digitizing the mixed signals entered through the amplifiers 15 and 16, respectively. The recovering apparatus body 17 further comprises a split spectra generating apparatus 22, equipped with a signal separating arithmetic circuit and a spectrum splitting arithmetic circuit. The signal separating arithmetic circuit performs the Fourier transform of the digitized mixed signals from the time domain to the frequency domain, and decomposes the mixed signals into two separated signals Ui and U2 by means of the Fast ICA. Based on transmission path characteristics of the four possible paths from the two sound sources 11 and 12 to the first and second microphones 13 and 14, the spectrum splitting arithmetic circuit generates from the separated signal U i one pair of split spectra vu and v12 which were received at the first microphone 13 and the second microphone 14 respectively, and generates from the separated signal U2 another pair of split spectra v21 and v2 which were received at the first microphone 13 and the second microphone 14 respectively. The recovering apparatus body 17 further comprises an estimated spectra extracting circuit 23 for extracting estimated spectra Y* of the target speech, wherein the split spectra vπ, vχ2, v21, and v22 are analyzed by applying criteria based on sound transmission characteristics that depend on the four different distances between the first and second microphones 13 and 14 and the sound sources 11 and 12 to assign each split spectrum to the target speech or to the noise. The recovering apparatus body 17 further comprises a speech segment detection circuit 24 for separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y*, and detecting a speech segment in the frame-number domain of a total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a threshold value β that is determined by the maximum value of F. The recovering apparatus body 17 further comprises a recovered spectra extracting circuit 25 for extracting components falling in the speech segment from each of the estimated spectrum series in Y* to generate a recovered spectrum group of the target speech. The recovering apparatus body 17 further comprises a recovered signal generating circuit 26 for performing the inverse Fourier transform of the recovered spectrum group from the frequency domain to the time domain to generate the recovered signal of the target speech. The split spectra generating apparatus 22, equipped with the signal separating arithmetic circuit and the spectrum splitting arithmetic circuit, the estimated spectra extracting circuit 23, the speech segment detection circuit 24, the recovered spectra extracting circuit 25, and the recovered signal generating circuit 26 may be structured by loading programs for executing each circuit's functions on, for example, a personal computer. Also, it is possible to load the programs on a plurality of microcomputers and form a circuit for collective operation of these microcomputers. In particular, if the programs are loaded on a personal computer, the entire recovering apparatus body 17 may be structured by incorporating the A D converters 20 and 21 into the personal computer. For the recovered signal amplifier 18, an amplifier that allows analog conversion and non-distorted amplification of audible signals may be used. A
10
RECTIFIED S loudspeaker that allows non-distorted output of audible signals may be used for the loudspeaker 19. The method for recovering target speech based on speech segment detection under a stationary noise according to the first embodiment of the present invention comprises: the first step of receiving a signal sι(t) from the sound source 11 and a signal s2(t) from the sound source 12 at the first and second microphones 13 and 14 and forming mixed signals xι(t) and x2(t) at the first microphone 13 and at the second microphone 14 respectively, performing the Fourier transform of the mixed signals xι(t) and x2(t) from the time domain to the frequency domain, and extracting estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Fast
ICA, as shown in FIG.2; the second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y*; the third step of detecting a speech segment and a noise segment in the frame-number domain of a total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a threshold value β that is determined by the maximum value of F; and the fourth step of extracting components falling in the speech segment from each of the estimated spectrum series in Y* to generate a recovered spectrum group of the target speech, and performing the inverse Fourier transform of the recovered spectrum group from the frequency domain to the time domain to generate the recovered signal of the target speech. The above steps are described in detail below. Here, "t" represents time throughout. 1. First Step In general, the signal sι(t) from the sound source 11 and the signal s2(t)-from the sound source 12 are assumed to be statistically independent of each other. The mixed signals xι(t) and x2(t), which are obtained by receiving the signals sι(t) and s2(t) at the microphones 13 and 14 respectively, are expressed as in Equation (1): x (t) =G (t) *s (t) (1)
RECTIFIED SHE RULE 91 where s(t)=[sι(t), s2(t)]τ, x(t)=[xι(t), x2(t)]τ, * is a convolution operator, and G(t) represents transfer functions from the sound sources 11 and 12 to the first and second microphones 13 and 14. As in Equation (1), when the signals from the sound sources 11 and 12 are convoluted, it is difficult to separate the signals sι(t) and s2(t) from the mixed signals x^t) and x2(t) in the time domain. Therefore, the mixed signals x^t) and x2(t) are divided into short time intervals (frames) and are transformed from the time domain to the frequency domain for each frame as in Equation (2):
XJ , k) = ∑ e"Vrrω lχ j(t) (t-k τ ) (2) (j=l , 2 ; k=0, 1, • • • f K-l ) where ω (=0, 2π/M, ..., 2π(M-l) M) is a normalized frequency, M is the number of sampling in a frame, w(t) is a window function, τ is a frame interval, and K is the number of frames. For example, the time interval can be about several 10 msec. In this way, it is also possible to treat the spectra as a group of spectrum series by laying out the components at each frequency in the order of frames. Moreover, in the frequency domain, it is possible to treat the recovery problems just like in the case of instant mixing. In this case, mixed signal spectra x(ω,k) and corresponding spectra of the signals Sι(t) and s2(t) are related to each other in the frequency domain as in Equation (3): x ( ω , k) =G ( ω ) s ( ω , k) (3) where s(ω,k) is the discrete Fourier transform of a windowed s(t), and G(ω) is a complex number matrix that is the discrete Fourier transform of G(t). Since the signal spectra Si(ω,k) and s2(ω,k) are inherently independent of each other, if mutually independent separated signal spectra Uj(ω,k) and U2(ω,k) are calculated from the mixed signal spectra x(ω,k) by use of the Fast ICA, these separated spectra will correspond to the signal spectra s,;(ω,k) and s2(ω,k) respectively. In other words, by obtaining a separation matrix H(ω)Q(ω) with which the relationship expressed in Equation (4) is valid between the mixed signal spectra x(ω,k) and the separated signal spectra U^ω-k) and U2(ω,k), it becomes possible to determine the
12 RECTIFIED SHEET (RULE 91) mutually independent separated signal spectra Uj(ω,k) and U2(ω,k) from the mixed signal spectra x(ω,k). u(ω. k) = H(ω)Q(ω)x(ω) (4) where u(ω,k)=Uι(ω,k),U2(ω,k)]τ. Incidentally, in the frequency domain, amplitude ambiguity and permutation occur at individual frequencies as in Equation (5):
H ( ω ) Q ( ω ) G ( ω ) =PD ( ω ) (5) where H(ω) is defined later in Equation (10), Q(ω) is a whitening matrix, P is a matrix representing permutation with only one element in each row and each column being 1 and all the other elements being 0, and D(ω)=diag[d!(ω),d2(ω)] is a diagonal matrix representing the amplitude ambiguity. Therefore, these problems need to be addressed in order to obtain meaningful separated signals for recovering. In the frequency domain, on the assumption that its real and imaginary parts have the mean 0 and the same variance and are uncorrelated, each sound source spectrum sj(ω,k) (i=l,2) is formulated as follows. First, at a frequency ω, a separation weight hn(ω) (n=l,2) is obtained according to the FastlCA algorithm, which is a modification of the Independent Component
Analysis algorithm, as shown in Equations (6) and (7):
+ 1 |K Λ-~1 ' 2 hn(ω )=^ K.| Σ=0 {x(ω , k)un(ω , k)f (|un(α> , k) I ) -Cf ( I u„( ω, k) |2) + |un(ω,k)|2f'(|u„(ω,k)|2)]h„(α>)} (6)
Figure imgf000015_0001
where f(|un(ω,k)|2) is a nonlinear function, and f (|un(ω,k)|2) is the derivative of f(|un(ω,k)|2), is a conjugate sign, and K is the number of frames. This algorithm is repeated until a convergence condition CC shown in Equation (8):
13 RECTIFIED SHEET (RULE 91) 7-T CC=hή ( ω ) hn ( ω ) ~ 1 (8) is satisfied (for example, CC becomes greater than or equal to 0.9999). Further, h2(ω) is orthogonalized with hι(ω) as in Equation (9):
-T , h2 ( ω ) =h2 ( ω ) -hi ( ω ) hi ( α> ) h2 ( ω ) (9) and normalized as in Equation (7) again. The aforesaid FastICA algorithm is carried out for each frequency ω. The obtained separation weights hn(ω) (n=l,2) determine H(ω) as in Equation (10):
Figure imgf000016_0001
which is used in Equation (4) to calculate the separated signal spectra u(ω,k) = [TJι(ω,k),U2(ω,k)]τ at each frequency. As shown in FIG. 2, two nodes where the separated signal spectra Uj(ω,k) and U2(ω,k) are outputted are referred to as 1 and 2. The split spectra
Figure imgf000016_0002
and
Figure imgf000016_0003
are defined as spectra generated as a pair (1 and 2) at nodes n (=1, 2) from the separated signal spectra Uι(ω,k) and U2(o,k) respectively, as shown in Equations (11) and (12): rvιι ( ω . k)
Figure imgf000016_0005
Figure imgf000016_0004
(I D
Figure imgf000016_0006
(12)
If the permutation is not occurring but the amplitude ambiguity exists, the separated signal spectra Un(ω,k) are outputted as in Equation (13):
14
RECTIFIED SHEET (RULE 91) Uι(ω,k) dι( ω ) sι(ω , k) U2(ω,k) (13) ( ω ) S2(ω , k)
Then, the split spectra for the above separated signal spectra Un(ω,k) are generated as in Equations (14) and (15):
Figure imgf000017_0001
V2i(ω , k) gι2( ω ) S2( ω , k) (15) _V22(α> , k)_ _g22( ω ) S2( ω , k)_ which show that the split spectra at each node are expressed as the product of the spectrum sι(ω,k) and the transfer function, or the product of the spectrum s(ω,k) and the transfer function. Note here that gπ(ω) is a transfer function from the sound source 11 to the first microphone 13, g2i(ω) is a transfer function from the sound source 11 to the second microphone 14, gι2(ω) is a transfer function from the sound source 12 to the first microphone 13, and g22(ω) is a transfer function from the sound source 12 to the second microphone 14. If there are both permutation and amplitude ambiguity, the separated signal spectra Un(ω,k) are expressed as in Equation (16):
Figure imgf000017_0002
and the split spectra at the nodes 1 and 2 are generated as in Equations (17) and (18):
Figure imgf000017_0003
Figure imgf000017_0004
15 RECTIFIED SHEET (RULE 91) In the above, the spectrum vπ(ω,k) generated at the node 1 represents the signal spectrum s2(ω,k) transmitted from the sound source 12 and observed at the first microphone 13, the spectrum vι2(ω,k) generated at the node 1 represents the signal spectrum s2(ω,k) transmitted from the sound source 12 and observed at the second microphone 14, the spectrum v21(ω,k) generated at the node 2 represents the signal spectrum sι(ω,k) transmitted from the sound source 11 and observed at the first microphone 13, and the spectrum v22(ω,k) generated at the node 2 represents the signal spectrum sj(ω,k) transmitted from the sound source 11 and observed at the second microphone 14. The four spectra vn(ω,k), v12(ω,k), v21(ω,k) and v22(ω,k) shown in FIG. 2 can be separated into two groups, each consisting of two split spectra. One of the groups corresponds to one sound source, and the other corresponds to the other sound source. For example, in the absence of permutation, Vn(ω,k) and v12(ω,k) correspond to one sound source; and in the presence of permutation, v21(ω,k) and V22(ω,k) correspond to the one sound source. Due to sound transmission characteristics, for example, sound intensities, that depend on the four different distances between the first and second microphones and the two sound sources, spectral intensities of the split spectra vu, v]2 v2ι, and v^ differ from one another. Therefore, if distinctive distances are provided between the microphones and the sound sources, it is possible to determine which microphone received which sound source's signal. That is, it is possible to identify the sound source for each of the split spectra vπ, v12, v2ι, and v22. Here, it is assumed that the sound source 11 is closer to the first microphone 13 than to the second microphone 14 and that the sound source 12 is closer to the second microphone 14 than to the first microphone 13. In this case, comparison of transmission characteristics between the two possible paths from the sound source 11 to the microphones 13 and 14 provides a gain comparison as in Equation (19):
| gn ( ω ) | > | g2i ( ω ) I (19)
16 RECTIFIED SHEET RULE 91) Similarly, by comparing transmission characteristics between the two possible paths from the sound source 12 to the microphones 13 and 14, a gain comparison is obtained as in Equation (20):
| gi2 ( ω ) | < | g22 ( α> ) I (20)
In this case, when Equations (14) and (15) or Equations (17) and (18) are used with the gain comparison in Equations (19) and (20), if there is no permutation, calculation of the difference Di between the spectra vπ and v12 and the difference D2 between the spectra v2J and v22 shows that Di at the node 1 is positive and D2 at the node 2 is negative. On the other hand, if there is permutation, the similar analysis shows that D;, at the node 1 is negative and D2 at the node 2 is positive. In other words, the occurrence of permutation is recognized by examining the differences D! and D2 between respective split spectra: if DΪ at the node 1 is positive and D2 at the node 2 is negative, the permutation is considered not occurring; and if Di at the node 1 is negative and D2 at the node 2 is positive, the permutation is considered occurring. In case the difference Oι is calculated as a difference between absolute values of the spectra vu and v12, and the difference D2 is calculated as a difference between absolute values of the spectra v21 and v22, the differences Oi and D2 are expressed as in Equations (21) and (22), respectively:
Dι= | vιι ( o> , k) | - | vi2 ( ω , k) I (21)
U2= | v2i ( ω , k) I - I 22 ( ω , k) I (22)
If there is no permutation, vn(ω,k) is selected as a spectrum yι(ω,k) of the signal from the one sound source that is closer to the first microphone 13 than to the second microphone 14. This is because the spectral intensity of vπ(ω,k) observed at the first microphone 13 is greater than the spectral intensity of vι2(ω,k) observed at the second microphone 14, and vn(ω,k) is less subject to the background noise than v!2(ω,k). Also, if there is permutation,. v2J(ω,k) is selected as the spectrum y^k) for
17 RECTIFIED SHEET RULE 91) the one sound source. Therefore, the spectrum yι(ω,k) for the one sound source is expressed as in Equation (23):
Figure imgf000020_0001
Similarly for a spectrum y2(ω,k) for the other sound source, the spectrum v22(ω,k) is selected if there is no permutation, and the spectrum vι2(ω,k) is selected if there is permutation as in Equation (24):
, . x fvι2 ( ω , k) i f Dι<0, D2>0 , y2 ( ω , k) = v , κχ Λ Jt n ^ Λ ,Λ (24) jv∞( ω , k) if Dι>0, D2<0
The permutation occurrence is determined by using Equations (21) and (22). The FastICA method is characterized by its capability of sequentially separating signals from the mixed signals in descending order of non-Gaussianity.
Speech generally has higher non-Gaussianity than noises. Thus, if observed sounds consist of the target speech (i.e., speaker's speech) and the noise, it is highly probable that a split spectrum corresponding to the speaker's speech is in the separated signal Ui, which is the first output of this method. Thus, if the one sound source is the speaker, the permutation occurrence is highly unlikely; and if the other sound source is the speaker, the permutation occurrence is highly likely. Therefore, while the spectra yi and y2 are generated, the number of permutation occurrences N~ and the number of non-occurrences N+ over all the frequencies are counted, and the estimated spectra Y* and Y are determined by using the criteria given as: (a) if the count N* is greater than the count N~, select the spectrum yi as the estimated spectrum Y* and select the spectrum y2 as the estimated spectrum Y; or (b) if the count N" is greater than the count N4, select the spectrum y2 as the estimated spectrum Y* and select the spectrum yj as the estimated spectrum Y.
2. Second Step
18 I IED S LE 91 FIG.3 shows the waveform of the target speech ( "Tokyo" ), which was obtained after the inverse transform of the recovered spectrum group comprising the estimated spectra as obtained above. It can be seen in this figure that the noise signal still remains in the recovered signal of the target speech. Therefore, the estimated spectrum series at each frequency was investigated. It was found that the noise had been removed from some of the estimated spectrum series in Y*, and an example is shown in FIG.4, and the noise still remains in the other estimated spectrum series in Y*, and an example is shown in FIG.5. In the estimated spectrum series in which the noise has been removed, the amplitude is large in the speech segment, and is extremely small in the non-speech segment, clearly defining the start and end points of the speech segment. Thus, it is expected that by using only the estimated spectrum series in which the noise has been removed, the speech segment can be obtained accurately. FIG. 6 shows the amplitude distribution of the estimated spectrum series in FIG. 4; and FIG. 7 shows the amplitude distribution of the estimated spectrum series in
FIG. 5. It can be seen from these figures that the amplitude distribution of the estimated spectrum series in which the noise has been removed has a high kurtosis; and the amplitude distribution of the estimated spectrum series in which the noise remains has a low kurtosis. Therefore, by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y*, it is possible to separate the estimated spectra Y* into an estimated spectrum series group y* in which the noise has been removed and an estimated spectrum series group y in which the noise remains. In order to quantitatively evaluate kurtosis values, entropy E of an amplitude distribution may be employed. The entropy E represents uncertainty of a main amplitude value. Thus, when the kurtosis is high, the entropy is low; and when the kurtosis is low, the entropy is high. Therefore, by use of a predetermined threshold value α, the separation judgment criteria are given as: (1) if the entropy E of an estimated spectrum series in Y* is less than the threshold value α, the estimated spectrum series in Y* is assigned to y*; and (2) if the entropy E of an estimated spectrum series in Y* is greater than or equal to the threshold value α, the estimated spectrum series in Y* is assigned to y.
19 RECTIFIED SHEET (RULE 91) The entropy is defined as in the following Equation (25):
N E ( ω ) =~ ∑ P ω( ln ) l θg ω ( l n ) (25) n=l -
where pm (ln) (n = 1 , 2, • • % N) is a probability, which is equivalent to qω (ln) (n = 1 , 2, — , N) normalized.as in the following Equation (26). Here, ln indicates the n-th interval when the amplitude distribution range is divided into N equal intervals for the real part of an estimated spectrum series at each frequency in Y*, and qω (ln) is a frequency of occurrence within the n-th interval.
, ( ln ) =q ω ( ln ) / ∑ q α,( ln ) (26) n=l
3. Third Step Since the frequency components of a speech signal varies with time, the frame-number range characterizing speech varies from an estimated spectrum series to an estimated spectrum series in y*. By taking a summation of all the estimated spectrum series in y* at each frame number, the frame-number range characterizing the speech can be clearly defined. An example of the total sum F of all the estimated spectrum series in y* is shown in FIG. 8, where each amplitude value is normalized by the maximum value (which is 1 in FIG. 8). By specifying a threshold value β depending on the maximum value of F, the frame number range where F is greater than β may be defined as the speech segment, and the frame number range where F is less than or equal to β may be defined as the noise segment. Therefore, by applying the detection judgment criteria based on the amplitude distribution in FIG. 8 and the threshold value β, a speech segment detection function F*(k) is obtained, where F*(k) is a two-valued function which is 1 when F> β, and is 0 when F< β.
4. Fourth Step By multiplying each estimated spectrum series in Y* by the speech segment detection function F*(k), it is possible to extract only the components falling in the
20 RECTIFIED SHEET RULE 91) speech segment from the estimated spectrum series. Thereafter, the recovered spectrum group {Z(ω, k) | k = 0, 1, — , K-l } can be generated from all the estimated spectrum series in Y*, each having non-zero components only in the speech segment. The recovered signal of the target speech Z(t) is thus obtained by performing the inverse Fourier transform of the recovered spectrum group {Z (ω, k) I k = 0, 1, •■•, K-l } for each frame back to the time domain, and then taking the summation over all the frames as in Equation (27):
Figure imgf000023_0001
(t ) = ∑k w (t-k τ ) (27)
FIG.10 shows the recovered signal of the target speech after the inverse Fourier transform of the recovered spectrum group, which is obtained by multiplying each spectrum series in Y* by the speech segment detection function. It is clear upon comparing FIGs. 3 and 10 that there is no noise remaining in the recovered target speech in FIG. 10 unlike the recovered target speech in FIG.3. The method for recovering target speech based on speech segment detection under a stationary noise according to the second embodiment of the present invention comprises: the first step of receiving a signal sj(t) from the sound source 11 and a signal s2(t) from the sound source 12 (one of which is a target speech source and the other is a noise source) at the first and second microphones 13 and 14 and forming mixed signals x^t) and x2(t) at the first microphone 13 and at the second microphone 14 respectively, performing the Fourier transform of the mixed signals Xj(t) and x2(t) from the time domain to the frequency domain, and extracting the estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Fast ICA, as shown in FIG. 2; the second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y*; the third step of detecting a speech segment and a noise segment
21 RECTIFIED SHEET (RULE 91) in the time domain of a total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a threshold value β that is determined by the maximum value of F; and the fourth step of performing the inverse Fourier transform of the estimated spectra Y* from the frequency domain to the time domain to generate a recovered signal of the target speech and extracting components falling in the speech segment from the recovered signal of the target speech to recover the target speech. The differences in method between the first and second embodiments are in the third and fourth steps. In the second embodiment, the speech segment is obtained in the time domain, and the target speech is recovered by extracting the components falling in the speech segment from the recovered signal of the target speech in the time domain. Therefore, only the third and fourth steps are explained below. The relationship between the frame number k and the sampling time t is expressed as: τ (k-l) < t ≤τ k, where τ is the frame interval. Thus, k = [t τ] holds, where [t/ τ] is a Ceiling symbol indicating the smallest integer among all the integers larger than t τ, and a speech segment detection function in the time domain F*(t) can be defined as: F*(t) = 1 in the range where F*([t/ τ]) = 1; and F*(t) = 0 in the range where F*([t/ τ]) = 0. Therefore, in the third step in the second embodiment, the speech segment is defined as the range in the time domain where F*([t/ τ]) = 1 holds; and the noise segment is defined as the range in the time domain where F*([t τ]) = 0 holds. In the fourth step of the second embodiment, the recovered signal of the target speech, which is obtained after the inverse Fourier transform of the estimated spectra Y* from the frequency domain to the time domain, is multiplied by F*(t), which is the speech segment detection function in the time domain, to extract the target speech signal.
The resultant target speech signal is amplified by the recovered signal amplifier 18 and inputted to the loudspeaker 19.
(A) Example 1 Experiments were conducted in a virtual room with 10m length, 10m width, and 10m height. Microphones 1 and 2 and sound sources 1 and 2 were placed in the room as in the FIG. 11. The mixed signals received at the microphones 1 and 2 were
22 RECTIFIED SHEET (RULE 91) analyzed by use of the FastICA, and a noise was removed to recover the target speech. The detection accuracy of the speech segment was evaluated. The distance between the microphones 1 and 2 was 0.5m; the distance between the two sound sources 1 and 2 was 0.5m; the microphones were placed lm above the floor level; the two sound sources were placed 0.5m above the floor level; the distance between the microphone 1 and the sound source 1 was 0.5m; and the distance between the microphone 2 and the sound source 2 was 0.5m. The FastICA was carried out by employing the method described in "Permutation Correction and Speech Extraction Based on Split Spectrum through Fast ICA" by H. Gotanda, K. Nobu, T. Koya, K. Kaneda, and T. Ishibasbi, Proc. of International Symposium on
Independent Component Analysis and Blind Signal Separation, April 1, 2003, pp.379- 384. At the sound source 1, each of two speakers (one male and one female) was placed and spoke five difference words (zairyo, iyoiyo, urayamasii, omosiroi, and guai), emitting total often different speech patterns. At the sound source 2, five different stationary noises (f!6 noise, volvo noise, white noise, pink noise, and tank noise) selected from Noisex-92 Database (http://spib.rice.edu spib') were emitted. From the above, total of 50 different mixed signals were generated. The speech segment detection function F*(k) is two-valued depending on the total sum F with respect to the threshold value β, and the total sum F is determined from the estimated spectrum series group y* which is separated from the estimated spectra Y* according to the threshold value α; thus, the speech segment detection accuracy depends on α and β. Investigation was made to determine optimal values for α and β. The optimal values for α were found to be 1.8 - 2.3; and the optimal values for β were found to be 0.05 - 0.15. The values of α = 2.0 and β = 0.08 were selected. The start and end points of the speech segment were obtained according to the present method. Also, a visual inspection on the waveform of the target speech signal recovered from the estimated spectra Y* was carried out to visually determine the start and end points of the speech segment. The comparison between the two methods revealed that the start point of the speech segment determined according to the present method was -2.71msec (with a standard deviation of 13.49msec) with respect to the start point determined by the visual inspection; and the end point of the speech segment determined according to the present method was -4.96msec (with a standard deviation
23 RECTIFIED SHEET RULE 91) of 26.07msec) with respect to the end point determined by the visual inspection. Therefore, the present method had a tendency of detecting the speech segment earlier that the visual inspection. Nonetheless, the difference in the speech segment between the two methods was very small, and the present method detected the speech segment with reasonable accuracy.
(B) Example 2 At the sound source 2, five different non-stationary noises (office, restaurant, classical, station, and street) selected from NTT Noise Database (Ambient Noise Database for Telephonometiy, NTT Advanced Technology Inc., 1996) were emitted.
Experiments were conducted with the same conditions as in Example 1. The results showed that the start point of the speech segment deteπnined according to the present method was -2.36msec (with a standard deviation of 14.12msec) with respect to the start point determined by the visual inspection; and the end point of the speech segment determined according to the present method was -
13.40 msec (with a standard deviation of 44.12msec) with respect to the end point determined by the visual inspection. Therefore, the present method is capable of detecting the speech segment with reasonable accuracy, functioning almost as well as the visual inspection even for the case of a non-stationary noise. While the invention has been so described, the present invention is not limited to the aforesaid embodiments and can be modified variously without departing from the spirit and scope of the invention, and may be applied to cases in which the method for recovering target speech based on speech segment detection under a stationary noise according to the present invention is structured by combining part or entirety of each of the aforesaid embodiments and/or its modifications. For example, in the present method, the FastICA is employed in order to extract the estimated spectra Y* and Y corresponding to the target s peech and the noise respectively, but the extraction method does not have to be limited to this method. It is possible to extract the estimated spectra Y* and Y by using the ICA, resolving the scaling ambiguity based on the sound transmission characteristics that depend on the four different paths between the two microphones and the sound sources, and resolving the permutation problem based on the similarity of envelop curves of spectra at individual frequencies.

Claims

WE CLAIM: 1. A method for recovering target speech based on speech segment detection under a stationary noise, the method comprising: a first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and forming mixed signals at a first microphone and at a second microphone, which are provided at separate locations, performing the
Fourier transform of the mixed signals from a time domain to a frequency domain, and extracting estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Independent Component Analysis; a second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on a kurtosis of an amplitude distribution of each of estimated spectrum series in Y*. a third step of detecting a speech segment and a noise segment in a frame number domain of a total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a predetermined threshold value β that is determined by a maximum value of F; and a fourth step of extracting components falling in the speech segment from each of the estimated spectrum series in Y* to generate a recovered spectrum group of the target speech, and performing the inverse Fourier transform of the recovered spectrum group from the frequency domain to the time domain to generate a recovered signal of the target speech.
2. The method set forth in Claim 1 , wherein the detection judgment criteria define the speech segment as a frame number range where the total sum F is greater than the threshold value β and the noise segment as a frame number range where the total sum F is less than or equal to the threshold value β.
3. A method for recovering target speech based on speech segment detection under a stationary noise, the method comprising:
26 RECTIFIED SHEET RULE 91) a first step of receiving target speech emitted from a sound source and a noise emitted from another sound source and forming mixed signals at a first microphone and at a second microphone, which are provided at separate locations, performing the Fourier transform of the mixed signals from a time domain to a frequency domain, and extracting estimated spectra Y* and Y corresponding to the target speech and the noise by use of the Independent Component Analysis; a second step of separating the estimated spectra Y* into an estimated spectrum series group y* in which the noise is removed and an estimated spectrum series group y in which the noise remains by applying separation judgment criteria based on a kurtosis of an amplitude distribution of each of estimated spectrum series in
Y*; a third step of detecting a speech segment and a noise segment in the time domain of a total sum F of all the estimated spectrum series in y* by applying detection judgment criteria based on a predetermined threshold value β that is determined by a maximum value of F; and a fourth step of performing the inverse Fourier transform of the estimated spectra Y* from the frequency domain to the time domain to generate a recovered signal of the target speech and extracting components falling in the speech segment from the recovered signal of the target speech to recover the target speech.
4. The method set forth in Claim 3, wherein the detection judgment criteria define the speech segment as a time interval where the total sum F is greater than the threshold value β, and the noise segment as a time interval where the total sum F is less than or equal to the threshold value β.
5. The method set forth in Claim 1 , 2, 3, or 4, wherein the kurtosis of the amplitude distribution of each of the estimated spectrum series in Y* is evaluated by means of entropy E of the amplitude distribution.
6. The method set forth in Claim 5, wherein the separation judgment criteria are given as:
27 RECTIFIED SHEET RULE 91) (1 ) if the entropy E of an estimated spectrum series in Y* is less than a predetermined threshold value α, the estimated spectrum series in Y* is assigned to the estimated spectrum series group y*; and
(2) if the entropy E of an estimated spectrum series in Y* is greater than or equal to the threshold value α, the estimated spectrum series in Y* is assigned to the estimated spectrum series group y.
28 RECTIFIED SHEET RULE 91)
PCT/JP2004/012899 2003-09-05 2004-08-31 A method for recovering target speech based on speech segment detection under a stationary noise WO2005029463A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/570,808 US7533017B2 (en) 2004-08-31 2004-08-31 Method for recovering target speech based on speech segment detection under a stationary noise

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2003-314247 2003-09-05
JP2003314247A JP4496378B2 (en) 2003-09-05 2003-09-05 Restoration method of target speech based on speech segment detection under stationary noise

Publications (2)

Publication Number Publication Date
WO2005029463A1 WO2005029463A1 (en) 2005-03-31
WO2005029463A9 true WO2005029463A9 (en) 2005-07-07

Family

ID=34372498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2004/012899 WO2005029463A1 (en) 2003-09-05 2004-08-31 A method for recovering target speech based on speech segment detection under a stationary noise

Country Status (2)

Country Link
JP (1) JP4496378B2 (en)
WO (1) WO2005029463A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006337851A (en) * 2005-06-03 2006-12-14 Sony Corp Speech signal separating device and method
DE602006019099D1 (en) * 2005-06-24 2011-02-03 Univ Monash LANGUAGE ANALYSIS SYSTEM
JP4556875B2 (en) 2006-01-18 2010-10-06 ソニー株式会社 Audio signal separation apparatus and method
JP2007324754A (en) 2006-05-30 2007-12-13 Ntt Docomo Inc Signal receiving section detector
US9159335B2 (en) 2008-10-10 2015-10-13 Samsung Electronics Co., Ltd. Apparatus and method for noise estimation, and noise reduction apparatus employing the same
JP5207479B2 (en) * 2009-05-19 2013-06-12 国立大学法人 奈良先端科学技術大学院大学 Noise suppression device and program
JP2011081293A (en) * 2009-10-09 2011-04-21 Toyota Motor Corp Signal separation device and signal separation method
JP6878776B2 (en) 2016-05-30 2021-06-02 富士通株式会社 Noise suppression device, noise suppression method and computer program for noise suppression
CN106157950A (en) * 2016-09-29 2016-11-23 合肥华凌股份有限公司 Speech control system and awakening method, Rouser and household electrical appliances, coprocessor
CN106504762B (en) * 2016-11-04 2023-04-14 中南民族大学 Bird community number estimation system and method
CN109951762B (en) * 2017-12-21 2021-09-03 音科有限公司 Method, system and device for extracting source signal of hearing device
CN112289343B (en) * 2020-10-28 2024-03-19 腾讯音乐娱乐科技(深圳)有限公司 Audio repair method and device, electronic equipment and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020116187A1 (en) * 2000-10-04 2002-08-22 Gamze Erten Speech detection

Also Published As

Publication number Publication date
JP2005084244A (en) 2005-03-31
JP4496378B2 (en) 2010-07-07
WO2005029463A1 (en) 2005-03-31

Similar Documents

Publication Publication Date Title
Luo et al. Speaker-independent speech separation with deep attractor network
US7562013B2 (en) Method for recovering target speech based on amplitude distributions of separated signals
US9668066B1 (en) Blind source separation systems
US9008329B1 (en) Noise reduction using multi-feature cluster tracker
US7533017B2 (en) Method for recovering target speech based on speech segment detection under a stationary noise
US7383178B2 (en) System and method for speech processing using independent component analysis under stability constraints
US9524730B2 (en) Monaural speech filter
US7315816B2 (en) Recovering method of target speech based on split spectra using sound sources&#39; locational information
JP5375400B2 (en) Audio processing apparatus, audio processing method and program
Hassan et al. A comparative study of blind source separation for bioacoustics sounds based on FastICA, PCA and NMF
CN111899756B (en) Single-channel voice separation method and device
KR101877127B1 (en) Apparatus and Method for detecting voice based on correlation between time and frequency using deep neural network
WO2005029463A9 (en) A method for recovering target speech based on speech segment detection under a stationary noise
KR20130068869A (en) Interested audio source cancellation method and voice recognition method thereof
Do et al. Speech Separation in the Frequency Domain with Autoencoder.
JP2002023776A (en) Method for identifying speaker voice and non-voice noise in blind separation, and method for specifying speaker voice channel
Girin et al. Audio source separation into the wild
Agcaer et al. Optimization of amplitude modulation features for low-resource acoustic scene classification
Subba Ramaiah et al. A novel approach for speaker diarization system using TMFCC parameterization and Lion optimization
CN116312561A (en) Method, system and device for voice print recognition, authentication, noise reduction and voice enhancement of personnel in power dispatching system
Kothapally et al. Speech Detection and Enhancement Using Single Microphone for Distant Speech Applications in Reverberant Environments.
Chowdhury et al. Speech enhancement using k-sparse autoencoder techniques
CN110675890B (en) Audio signal processing device and audio signal processing method
Pwint et al. A new speech/non-speech classification method using minimal Walsh basis functions
Versiani et al. Binary spectral masking for speech recognition systems

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BW BY BZ CA CH CN CO CR CU CZ DK DM DZ EC EE EG ES FI GB GD GE GM HR HU ID IL IN IS KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NA NI NO NZ OM PG PL PT RO RU SC SD SE SG SK SL SY TM TN TR TT TZ UA UG US UZ VC YU ZA ZM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SZ TZ UG ZM ZW AM AZ BY KG MD RU TJ TM AT BE BG CH CY DE DK EE ES FI FR GB GR HU IE IT MC NL PL PT RO SE SI SK TR BF CF CG CI CM GA GN GQ GW ML MR SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
COP Corrected version of pamphlet

Free format text: PAGES 3-6, 10-23, DESCRIPTION, REPLACED BY NEW PAGES 3-6, 10-23; PAGES 26-28, CLAIMS, REPLACED BY NEW PAGES 26-28; AFTER RECTIFICATION OF OBVIOUS ERRORS AUTHORIZED BY THE INTERNATIONAL SEARCH AUTHORITY

WWE Wipo information: entry into national phase

Ref document number: 2007055511

Country of ref document: US

Ref document number: 10570808

Country of ref document: US

122 Ep: pct application non-entry in european phase
WWP Wipo information: published in national office

Ref document number: 10570808

Country of ref document: US