WO2010061505A1 - 発話音声検出装置 - Google Patents
発話音声検出装置 Download PDFInfo
- Publication number
- WO2010061505A1 WO2010061505A1 PCT/JP2009/004339 JP2009004339W WO2010061505A1 WO 2010061505 A1 WO2010061505 A1 WO 2010061505A1 JP 2009004339 W JP2009004339 W JP 2009004339W WO 2010061505 A1 WO2010061505 A1 WO 2010061505A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- input power
- frequency
- power
- signal
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 122
- 238000012937 correction Methods 0.000 claims abstract description 230
- 230000005236 sound signal Effects 0.000 claims abstract description 90
- 238000012545 processing Methods 0.000 claims description 44
- 238000000034 method Methods 0.000 claims description 36
- 230000008569 process Effects 0.000 claims description 34
- 238000012935 Averaging Methods 0.000 claims description 11
- 230000010365 information processing Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 description 114
- 230000008859 change Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
- G10L2025/786—Adaptive threshold
Definitions
- the present invention relates to an utterance voice detection device that determines whether or not an input voice is an utterance voice.
- An utterance voice detection device that determines whether or not an input voice is an utterance voice (voice uttered by a user) is known.
- the device described in Patent Document 1 as one of this type of speech detection device includes a plurality of microphones.
- this utterance voice detection device accepts a voice signal input via each microphone. Then, the utterance voice detection device calculates an input power (input power of the voice signal) representing the magnitude of the voice represented by the received voice signal. Based on the calculated input power, the utterance voice detection device determines whether or not the voice represented by the voice signal input via each microphone is a utterance voice.
- the input power representing the magnitude of the voice represented by the voice signal received through each microphone may differ due to differences in microphones and the degree of deterioration over time, or differences in transmission systems (wiring, etc.).
- An apparatus described in Patent Document 2 as one of this type of signal correction apparatus receives an audio signal input via a certain microphone and calculates the input power of the received audio signal for each frequency. Next, the signal correction apparatus calculates the ratio between the reference power that is a reference (for example, the average value of the input power of the audio signal input through each microphone) and the calculated input power for each frequency, and calculates A correction coefficient is set according to the ratio.
- the signal correction device corrects the input power of the received audio signal based on the set correction coefficient. Thereby, the input power of the received audio signal can be brought close to the reference power for each frequency. Therefore, by applying this signal correction device to the utterance voice detection device, it is possible to determine with high accuracy whether or not each of the voices input via each microphone is a utterance voice.
- the signal correction apparatus there is a reason (for example, noise is superimposed on the input audio signal or delay time associated with propagation of the input audio signal is excessive).
- An audio signal having an input power that is excessively larger (or smaller) than other frequencies may be input.
- the correction coefficient set for this frequency is too small (or too large). That is, in such a case, the input power of the received audio signal cannot be made sufficiently close to the reference power at this frequency.
- the object of the present invention is an utterance capable of solving the above-mentioned problem “the case where it may not be possible to determine with high accuracy whether or not the input speech is speech speech” may occur. It is to provide a voice detection device.
- an utterance voice detection device includes: Voice receiving means for receiving an input voice signal; Based on the audio signal received by the audio receiving means, input power calculation means for performing input power calculation processing for calculating, for each frequency, input power representing the volume of the sound represented by the audio signal; Correction function estimation that estimates a correction function that is a continuous function that defines the relationship between the frequency and the correction coefficient for making the input power calculated for that frequency close to the reference power determined for that frequency Correction function estimation means for performing processing, Input power correction means for performing input power correction processing for correcting the input power by multiplying the calculated input power by a correction coefficient acquired according to the relationship defined by the estimated correction function for each frequency.
- Utterance voice detection means for performing utterance voice detection processing for determining whether or not the voice represented by the received voice signal is a utterance voice based on the corrected input power; Is provided.
- the speech detection method is as follows. Based on the audio signal received by the audio reception means that receives the input audio signal, an input power calculation process is performed to calculate, for each frequency, input power that represents the magnitude of the audio represented by the audio signal, Correction function estimation that estimates a correction function that is a continuous function that defines the relationship between the frequency and the correction coefficient for making the input power calculated for that frequency close to the reference power determined for that frequency Process, For each frequency, an input power correction process for correcting the input power is performed by multiplying the calculated input power by a correction coefficient acquired according to the relationship defined by the estimated correction function, This is a method of performing an utterance voice detection process for determining whether or not the voice represented by the received voice signal is an utterance voice based on the corrected input power.
- Input power calculation means for performing input power calculation processing for calculating, for each frequency, input power representing the magnitude of the voice represented by the voice signal based on the voice signal received by the voice reception means for receiving the input voice signal; , Correction function estimation that estimates a correction function that is a continuous function that defines the relationship between the frequency and the correction coefficient for making the input power calculated for that frequency close to the reference power determined for that frequency
- Correction function estimation means for performing processing
- Input power correction means for performing input power correction processing for correcting the input power by multiplying the calculated input power by a correction coefficient acquired according to the relationship defined by the estimated correction function for each frequency.
- the present invention is configured as described above, so that it is possible to determine with high accuracy whether or not the input voice is an uttered voice.
- the utterance voice detection device 1 is an information processing device.
- the utterance voice detection device 1 includes a central processing unit (CPU; Central Processing Unit), a storage device (memory and hard disk drive (HDD)), and an input device (not shown).
- CPU central processing unit
- HDD hard disk drive
- the input device is connected to a plurality of (in this example, L (L is an integer)) microphones MC1, ..., MCk, ..., MCL (where k is an integer from 1 to L).
- L is an integer
- Each microphone collects surrounding sounds and outputs an audio signal representing the collected sounds to the input device.
- the input device receives audio signals output from each microphone. Note that the input device and the microphones MC1 to MCL constitute voice receiving means.
- the function of the utterance voice detection device 1 configured as described above is realized by the CPU of the utterance voice detection device 1 executing a program or the like represented by the flowchart shown in FIG. This function may be realized by hardware such as a logic circuit.
- the utterance voice detection device 1 operates in the same manner for each of the plurality of microphones MC1 to MCL. Therefore, the function of the speech sound detection apparatus 1 for a microphone MCk that is an arbitrary one of the plurality of microphones MC1 to MCL will be described below.
- the function of the utterance voice detection device 1 is as follows: an input power calculation unit (input power calculation unit) 11, an input power correction unit (input power correction unit) 12, a time average power calculation unit (time average power calculation unit) 13, , A correction function estimation unit (correction function estimation unit) 14, a correction function storage unit 15, and a speech voice detection unit (speech voice detection unit) 16.
- the input power calculation unit 11 converts an audio signal from an analog signal to a digital signal by performing an A / D (analog / digital) conversion process on the audio signal input from the microphone MCk.
- the input power calculation unit 11 divides the converted audio signal at predetermined (constant in this example) frame intervals.
- the input power calculation unit 11 performs the following processing on each part (frame signal) of the divided audio signal.
- the input power calculation unit 11 performs predetermined preprocessing (for example, pre-emphasis processing and windowing processing for applying a window function) on the frame signal. Next, the input power calculation unit 11 performs a fast Fourier transform (FFT) process on the frame signal to obtain a frame signal (complex number composed of a real part and an imaginary part) in the frequency domain.
- predetermined preprocessing for example, pre-emphasis processing and windowing processing for applying a window function
- FFT fast Fourier transform
- the input power calculation unit 11 calculates, for each frequency, the sum of the value obtained by squaring the real part of the acquired frame signal and the value obtained by squaring the imaginary part of the acquired frame signal as input power x i (t ).
- a sampling frequency is 44.1 kHz and a signal quantized with 16 bits is used as a digital signal
- the frame interval is 10 ms, and FFT processing is performed at 1024 points.
- the input power x i (t) is calculated every about 43 Hz.
- i is a number corresponding to the frequency (in this example, i corresponds to increase by 1 and frequency increases by about 43 Hz)
- t is the frame signal on the time axis. This is a number representing a position (for example, a frame number for specifying a frame).
- the input power calculation unit 11 divides the audio signal received via the microphone MCk at predetermined frame intervals, and the input power x for each portion (frame signal) of the divided audio signal. i (t) is calculated for each frequency.
- the input power correction unit 12 multiplies the input power x i (t) calculated by the input power calculation unit 11 by the correction coefficient f i stored in the correction function storage unit 15 for each frequency to thereby generate the power. x i (t) is corrected. Then, the input power correction unit 12 outputs the corrected input power x ′ i (t).
- the correction coefficient f i is a value acquired according to the relationship defined by the correction function.
- the correction function includes a number i corresponding to a frequency (that is, a frequency) and a correction coefficient f for bringing the input power x i (t) calculated for the frequency close to the reference power determined for the frequency. It is a continuous function that defines the relationship between i and i .
- the correction function is a polynomial function whose frequency is a variable. As will be described later, the correction function is estimated by the time average power calculation unit 13 and the correction function estimation unit 14.
- the time average power calculation unit 13 includes the input power calculated by the input power calculation unit 11 (that is, the input power calculated for each portion divided for each frame interval of the audio signal) x i (t) Time average power x i (ie, a plurality of x i (t) for different t) obtained by averaging the input power x i (t) calculated for the frame signal corresponding to the preset averaging time T. (Average value) is calculated for each frequency.
- Correction function estimating unit 14 the frequency and the correction coefficient for approximating the time-averaged power calculated by the time average power calculation unit 13 with respect to frequency x i to the reference power y i defined for the frequency f
- a correction function that defines the relationship between i and i is estimated.
- the correction function estimation unit 14 uses, as the reference power y i , one microphone MCr (r is an integer from 1 to L) predetermined as a reference microphone among the microphones MC1 to MCL. using time average power x i calculated by time average power calculation unit 13 Te.
- the correction function estimation unit 14 calculates the matrix A based on the following formula (1).
- Correction function estimating unit 14 as a variable x i of each element of the matrix A in the above formula (1), using the time-average power x i calculated by time average power calculation unit 13 with respect to the microphone MCk.
- M is the order of the correction function.
- M is a preset value.
- M is preferably a value of 0 to 20.
- correction function estimation unit 14 calculates a vector b based on the following equation (2).
- the correction function estimation unit 14 uses the time average power (reference power) x i calculated by the time average power calculation unit 13 for the reference microphone MCr as the variable y i in each element of the vector b in the above equation (2). Is used.
- the correction function estimation unit 14 calculates the vector a based on the calculated matrix A, the calculated vector b, and the following equation (3).
- the vector a (a 1 , a 2 ,..., A M ) T.
- the correction function estimation unit 14 calculates a correction coefficient f i for each frequency based on the calculated vector a and the following equation (4).
- the following equation (4) represents a correction function that is a polynomial function having a variable corresponding to a number i (that is, frequency) corresponding to the frequency. That is, calculating the vector a corresponds to estimating the correction function.
- the correction function storage unit 15 stores the correction coefficient f i calculated by the correction function estimation unit 14 and the number i corresponding to the frequency in the storage device in association with each other.
- the input power correction unit 12 corrects the input power x i (t) calculated by the input power calculation unit 11 based on the following equation (5). That is, the input power correction unit 12 multiplies the input power x i (t) calculated by the input power calculation unit 11 by the correction coefficient f i stored in the correction function storage unit 15 for each frequency. The input power x i (t) is corrected. Then, the input power correction unit 12 outputs the corrected input power x ′ i (t).
- the above equations (1) to (3) are obtained by correcting the corrected input power x ′ i , the time average power (reference power) y i calculated by the time average power calculation unit 13 for the reference microphone MCr, and
- the vector a that minimizes the sum of values obtained by squaring the difference over a predetermined frequency range (in this example, a range corresponding to all the numbers i corresponding to the frequencies) is derived.
- the utterance voice detection unit 16 determines whether the voice represented by the voice signal received via the microphone MCk is the utterance voice. Speech speech detection processing for determining whether or not is performed.
- the speech sound detection unit 16 includes a noise power acquisition unit (noise power acquisition unit) 16a and a signal-to-noise ratio acquisition unit (signal-to-noise ratio acquisition unit) 16b.
- noise power acquisition unit noise power acquisition unit
- signal-to-noise ratio acquisition unit signal-to-noise ratio acquisition unit
- the noise power acquisition unit 16a acquires, for each frequency, noise power N i (t) representing the magnitude of noise in the voice represented by the voice signal received via the microphone MCk.
- the noise power acquisition unit 16a has, for each frequency, the input power x ′ i (t) output from the input power correction unit 12 to the microphone MCk for each of the plurality of microphones MC1 to MCL. If the input power x ′ i (t) output from the input power correction unit 12 is the maximum value, the noise power N i (t) for the microphone MCk is determined for each of the plurality of microphones MC1 to MCL. The minimum value of the input power x ′ i (t) output by the input power correction unit 12 is acquired.
- the noise power acquisition unit 16a causes the input power x ′ i (t) output from the input power correction unit 12 to the microphone MCk to be input to the microphones MC1 to MCL by the input power correction unit 12.
- the input power x ′ i (t) is not the maximum value
- the input power x ′ i output to the microphone MCk by the input power correction unit 12 as the noise power N i (t) for the microphone MCk. (T) is acquired.
- the noise power acquisition unit 16a for each frequency, the maximum input power x ′ of the input power x ′ i (t) output by the input power correction unit 12 for each of the plurality of microphones MC1 to MCL.
- the noise power N i (t) for the microphone (maximum power microphone) that has received the audio signal from which i (t) is calculated the noise is output by the input power correction unit 12 for each of the plurality of microphones MC1 to MCL.
- input power x 'i (t) the minimum of the input power x' has obtained a i (t), it can be said that.
- the noise power acquisition unit 16a sets the input power x ′ i (t) output from the input power correction unit 12 to the microphone as noise power N i (t) for the microphones other than the maximum power microphone for each frequency. ) Can be said to have acquired.
- the speech detection apparatus 1 is configured to make the signal-to-noise ratio SNR (t) for the maximum power microphone much larger than the signal-to-noise ratio SNR (t) for the other microphones.
- the signal-to-noise ratio acquisition unit 16b uses the noise power N i (t) acquired by the noise power acquisition unit 16a as the input power x ′ i (t) output by the input power correction unit 12 for each frequency. To calculate the signal-to-noise ratio SNR i (t) for each frequency. Furthermore, the signal-to-noise ratio obtaining unit 16b, as a value representing the calculated frequency dependent signal-to-noise ratio SNR i (t) signal-to-noise ratio SNR (t), calculated frequency dependent signal-to-noise ratio SNR i The sum over a predetermined frequency range of (t) (in this example, a range corresponding to all the numbers i corresponding to the frequencies) is acquired.
- a predetermined frequency range of (t) in this example, a range corresponding to all the numbers i corresponding to the frequencies
- the signal-to-noise ratio acquisition unit 16b may be configured to acquire the maximum value of the calculated signal-to-noise ratio SNR i (t) for each frequency as the signal-to-noise ratio SNR (t).
- the uttered voice detection unit 16 When the signal-to-noise ratio SNR (t) acquired by the signal-to-noise ratio acquisition unit 16b is larger than a preset threshold, the uttered voice detection unit 16 represents the voice represented by the voice signal received through the microphone MCk. Is determined to be speech. On the other hand, when the signal-to-noise ratio SNR (t) acquired by the signal-to-noise ratio acquisition unit 16b is smaller than the threshold value, the uttered voice detection unit 16 outputs the voice represented by the voice signal received via the microphone MCk. It is determined that the voice is not speech.
- the CPU of the utterance voice detection device 1 executes the utterance voice detection program shown by the flowchart in FIG. 2 every time a predetermined calculation cycle elapses.
- the CPU when starting the processing of the speech voice detection program, the CPU accepts voice signals input via the microphones MC1 to MCL in step 205. Then, the CPU divides the received audio signal for each frame interval, and performs input power calculation processing for calculating the input power x i (t) for each portion (frame signal) of the divided audio signal. Performed for each of the MCLs (input power calculation step).
- step 210 the CPU determines whether the received audio signal is an audio signal representing white noise. Now, the description will be continued assuming that the received audio signal is an audio signal representing white noise.
- the speech detection device 1 corrects function estimation process of estimating a correction function for each of a plurality of microphones MC1 ⁇ MCL (the processing for updating the correction coefficient f i stored in the storage device).
- the CPU determines “Yes” and proceeds to step 215. Then, the CPU calculates the average time of the input power calculated in step 205 (that is, the input power calculated for each portion divided for each frame interval of the audio signal) x i (t).
- a time average power calculation process for calculating a time average power x i obtained by averaging input powers x i (t) calculated for a frame signal corresponding to T for each frequency is performed for each of the plurality of microphones MC1 to MCL. (Time average power calculation step).
- the correction function CPU at step 220, based on the time calculated for a microphone MCk average power x i and time calculated for the reference microphone MCr average power y i, estimating a correction function
- the estimation process is performed for each of the plurality of microphones MC1 to MCL. Specifically, the CPU performs processing for calculating the vector a based on the above equations (1) to (3) for each of the plurality of microphones MC1 to MCL (correction function estimation step).
- CPU at step 225, the process of calculating the correction coefficient f i based on the calculated vector a, performed for each of the plurality of microphones MC1 ⁇ MCL. Then, CPU updates the already correction coefficient f i is the correction factor f i which when stored in the storage device to calculate the correction factor f i which is stored. On the other hand, if the correction factor f i is not stored in the storage device (the first correction coefficient f i is calculated) stores the calculated correction coefficient f i newly storage device.
- the uttered voice detection device 1 performs input power correction processing for correcting the input power of the voice signal received via the microphone MCk for each of the plurality of microphones MC1 to MCL.
- the CPU makes a “No” determination at step 210 to proceed to step 230, and calculates the correction coefficient f i stored in the storage device for each frequency (ie, number i corresponding to the frequency). , by multiplying the input power calculated in step 205 x i (t), the input power correction process for correcting the input power x i (t), carried out for each of the plurality of microphones MC1 ⁇ MCL (Input power correction process). Then, the CPU outputs the corrected input power x ′ i (t).
- step 235 the CPU performs a noise power acquisition process for acquiring the noise power N i (t) based on the output power x ′ i (t) output to each of the plurality of microphones MC1 to MCL. (Noise power acquisition step).
- the CPU calculates the maximum input power x ′ i (t) among the input power x ′ i (t) output to each of the plurality of microphones MC1 to MCL for each frequency.
- the input power x ′ i (t) output to each of the plurality of microphones MC1 to MCL as noise power N i (t) for the microphone (power maximum microphone) that has received the voice signal that is the basis
- the CPU acquires the input power x ′ i (t) output to the microphone as noise power N i (t) for the microphones other than the power maximum microphone for each frequency.
- the CPU acquires the input power x ′ i (t) output to the microphone MC1 as the noise power N i (t) for the microphone MC1. Further, the CPU acquires the input power x ′ i (t) output to the microphone MC1 as the noise power N i (t) for the microphone MC2. Further, the CPU acquires the input power x ′ i (t) output to the microphone MCk as the noise power N i (t) for the microphone MCk. In this way, the CPU acquires the noise power N i (t) for each of the plurality of microphones MC1 to MCL for each frequency.
- step 240 the CPU divides the output power x ′ i (t) output for each frequency by the acquired noise power N i (t) to obtain a signal-to-noise ratio SNR i for each frequency.
- the process of calculating (t) is performed for each of the plurality of microphones MC1 to MCL.
- the CPU calculates the sum of the calculated signal-to-noise ratio SNR i (t) over a predetermined frequency range (in this example, the range corresponding to all the numbers i corresponding to the frequencies) to the signal-to-noise ratio SNR.
- the processing acquired as (t) is performed for each of the plurality of microphones MC1 to MCL (signal-to-noise ratio acquisition step).
- step 245 the CPU determines whether the acquired signal-to-noise ratio SNR (t) is greater than a preset threshold value, thereby representing the audio signal received via the microphone MCk.
- Speech speech detection processing for determining whether speech is speech speech is performed for each of the plurality of microphones MC1 to MCL (speech speech detection step).
- the CPU determines that the signal-to-noise ratio SNR (t) is larger than the threshold
- the sound signal received via the microphone MCk corresponding to the signal-to-noise ratio SNR (t) This corresponds to the CPU determining that the voice to be represented is a speech voice.
- the speech detection device 1 estimates the correction function defining a relationship between the frequency and the correction coefficient f i, estimated The input power is corrected by multiplying the correction power f i set based on the correction function by the input power (the input power of the audio signal) representing the magnitude of the audio represented by the received audio signal.
- the input power of the audio signal can be brought close to the reference power with high accuracy. As a result, it is possible to determine with high accuracy whether or not the input voice is an uttered voice (voice uttered by the user).
- the correction function is a polynomial function with frequency as a variable. According to this, by adjusting the order M polynomial functions, can be adjusted to changes in the frequency, the degree of smoothness of the change in the correction factor f i.
- the speech sound detection apparatus 1 uses the input power x i (t) calculated for the reference microphone MCr that is one of the plurality of microphones MC1 to MCL as the reference power y i ( t).
- the input power x i (t) of the audio signal received by each of the plurality of microphones MC1 to MCL is used as the input power (reference power) y i (t) of the audio signal received by the reference microphone MCr.
- the input power x i (t) of the audio signal received by each of the plurality of microphones MC1 to MCL is used as the input power (reference power) y i (t) of the audio signal received by the reference microphone MCr.
- the speech detection device 1 to estimate a correction function based on the input power x i (t) averaged time-averaged power x i a calculated for the plurality of frame signals It is configured.
- the sound that is the basis of the sound signal that is the basis for calculating each of the time average power calculated for each microphone MCk and the time average power calculated for the reference microphone MCr is The degree of matching can be increased.
- the input power of the audio signal received by each microphone MCk can be made sufficiently close to the reference power (time average power calculated for the reference microphone MCr). it can.
- the input power x i (t) of the audio signal received by each microphone MCk can be made closer to the reference power y i (t) with higher accuracy.
- a speech sound detection apparatus according to the second embodiment of the present invention will be described with reference to FIG.
- the function of the utterance voice detection device 1 according to the second embodiment is that a voice reception unit (speech reception unit) 18, an input power calculation unit (input power calculation unit) 11, and an input power correction unit (input power correction unit) 12. And a correction function estimation unit (correction function estimation means) 14 and an utterance voice detection unit (utterance voice detection means) 16.
- the voice reception unit 18 receives an input voice signal. Based on the audio signal received by the audio reception unit 18, the input power calculation unit 11 performs an input power calculation process for calculating input power representing the magnitude of the audio represented by the audio signal for each frequency.
- the correction function estimation unit 14 defines a relationship between a frequency and a correction coefficient for bringing the input power calculated by the input power calculation unit 11 with respect to the frequency closer to the reference power determined for the frequency.
- a correction function estimation process for estimating a correction function that is a continuous function is performed.
- the input power correction unit 12 multiplies the input power calculated by the input power calculation unit 11 by the correction coefficient acquired according to the relationship defined by the correction function estimated by the correction function estimation unit 14 for each frequency. Then, an input power correction process for correcting the input power is performed.
- the utterance voice detection unit 16 determines whether or not the voice represented by the voice signal received by the voice reception unit 18 is the utterance voice based on the input power corrected by the input power correction unit 12. I do. Is provided.
- the utterance voice detection device 1 estimates a correction function that defines the relationship between the frequency and the correction coefficient, and the voice represented by the received voice signal represents the correction coefficient set based on the estimated correction function.
- the input power is corrected by multiplying the input power (the input power of the audio signal) representing the magnitude of the input signal.
- the input power of the audio signal can be brought close to the reference power with high accuracy. As a result, it is possible to determine with high accuracy whether or not the input voice is an uttered voice (voice uttered by the user).
- the correction function is preferably a polynomial function whose frequency is a variable.
- the degree of smoothness of the change of the correction coefficient with respect to the change of the frequency can be adjusted.
- the correction function estimation means is configured to estimate the correction function that minimizes the sum of a value obtained by squaring the difference between the corrected input power and the reference power over a predetermined frequency range. Is preferred.
- the speech sound detection means is Noise power acquisition means for acquiring, for each frequency, noise power indicating the magnitude of noise in the voice represented by the voice signal received by the voice reception means; For each frequency, the signal-to-noise ratio for each frequency is calculated by dividing the corrected input power by the acquired noise power, and a signal-to-noise value that is representative of the calculated signal-to-noise ratio for each frequency.
- a signal to noise ratio acquisition means for acquiring a ratio, and When the acquired signal-to-noise ratio is larger than a preset threshold value, it is preferable that the voice represented by the received voice signal is determined to be a speech voice.
- the signal-to-noise ratio acquisition unit is configured to acquire, as the signal-to-noise ratio, a sum of the calculated signal-to-noise ratio for each frequency over a predetermined frequency range. .
- the signal-to-noise ratio acquisition unit is configured to acquire the calculated maximum value of the signal-to-noise ratio for each frequency as the signal-to-noise ratio.
- the spoken voice detection device is A plurality of the voice receiving means are provided,
- the input power calculation means is configured to perform the input power calculation processing for each of the plurality of voice reception means,
- the correction function estimation means is configured to perform the correction function estimation processing for each of the plurality of voice reception means,
- the input power correction unit is configured to perform the input power correction process on each of the plurality of voice reception units,
- the spoken voice detection means is The uttered voice detection process is configured to be performed for each of the plurality of voice reception units, and the input is corrected for each of the plurality of voice reception units for each frequency by the input power correction unit.
- Input power corrected for each of the plurality of voice receiving means by the input power correcting means as noise power for the voice receiving means that has received the voice signal that is the basis for calculating the maximum input power of the power Is preferably configured to use the minimum input power.
- the speech sound detection means is For each frequency, other than the voice receiving unit that has received the voice signal that is the basis for calculating the maximum input power among the input powers corrected for each of the plurality of voice receiving units by the input power correcting unit. It is preferable that the input power corrected by the input power correction unit with respect to the voice reception unit is used as the noise power for the voice reception unit.
- a voice uttered to the first voice receiving means that is one of the plurality of voice receiving means is a plurality of voices. It is also input to the second voice receiving means which is another one of the voice receiving means.
- the signal-to-noise ratio of the voice input via the second voice reception means is smaller than the signal-to-noise ratio of the voice input via the first voice reception means, Even if it is determined whether or not the voice is an uttered voice based on the voice input through the voice receiving means, it cannot be determined with high accuracy.
- the utterance voice detection device having the above-described configuration has a signal-to-noise ratio with respect to the voice reception unit that has received the voice signal that is the basis for calculating the maximum input power among the input powers, with respect to other voice reception units. It is configured to be much larger than the signal-to-noise ratio.
- the voice reception means that has received the voice signal that is the basis for calculating the maximum input power of the input power. can do. Therefore, it can be determined with high accuracy whether or not the input voice is a speech voice.
- the correction function estimating unit is configured to use, as the reference power, the input power calculated by the input power calculating unit with respect to one of the plurality of voice receiving units.
- the input power of the sound signal received by each of the plurality of sound receiving means is the input power of the sound signal received by one of the plurality of sound receiving means (reference sound receiving means) (reference Power).
- the input power calculating unit is configured to divide the audio signal received by the audio receiving unit at predetermined frame intervals and calculate the input power for each frequency for each of the divided parts.
- the utterance voice detection device is Time average power calculation processing for calculating the time average power by averaging the input power calculated for each part of the audio signal by the input power calculation means for each of the plurality of audio reception means.
- Power calculation means, The correction function estimating means calculates the frequency and the time average power calculated for the frequency with respect to one of the plurality of voice receiving means by the time average power calculating means and
- the correction function estimation processing for estimating the correction function that defines the relationship between the correction coefficient for approaching the time average power calculated in the above is performed for each of the plurality of voice receiving units. Is preferred.
- each of the plurality of sound receiving means for example, microphones
- the sound source that emits the sound that is the basis of the sound signal is relatively different, sound propagation from the sound source to each sound receiving means.
- the delay time associated with is relatively different.
- the first voice receiving unit that is one of the plurality of voice receiving units receives the first voice signal
- the second voice receiving unit that is the other one of the plurality of voice receiving units.
- the time required for transmitting the audio signal from the first audio receiving means to the signal correction apparatus and the time required for transmitting the audio signal from the second audio receiving means to the signal correction apparatus are relatively large. Even in a different case, the sound that is the basis of the first audio signal received by the signal correction apparatus via the first audio reception means and the second that the signal correction apparatus receives via the second audio reception means. Is different from the voice that is the basis of the voice signal.
- the utterance voice detection device when configured to estimate the correction function based only on the voice signal at a certain time point, the input power of the voice signal received by the first voice receiving means is obtained.
- the input power (reference power) of the audio signal received by the second audio receiving means cannot be sufficiently close.
- the time average power calculated for the first voice reception unit and the time average power calculated for the second voice reception unit are calculated. It is possible to increase the degree of coincidence of the voice that is the basis of the voice signal. As a result, by correcting the input power of the voice signal received by the first voice receiving means, the input power of the voice signal is changed to the reference power (time average power calculated for the second voice receiving means). Can be close enough.
- the input power of the audio signal received by the first audio receiving means can be made closer to the reference power with higher accuracy.
- the correction function estimating means is configured to use, as the reference power, an average power obtained by averaging the input powers calculated for each of the plurality of voice receiving means by the input power calculating means. .
- the input power calculating unit is configured to divide the audio signal received by the audio receiving unit at predetermined frame intervals and calculate the input power for each frequency for each of the divided parts.
- the utterance voice detection device is Time average power calculation processing for calculating the time average power by averaging the input power calculated for each part of the audio signal by the input power calculation means for each of the plurality of audio reception means.
- Power calculation means, The correction function estimating means calculates the frequency and the time average power calculated for the frequency for each of the plurality of voice receiving means by the time average power calculating means and for the frequency.
- the correction function estimation process for estimating the correction function that defines the relationship between the calculated time average power and the correction coefficient for approximating the average time average power to the average time average power is performed for each of the plurality of voice receiving units. It is preferable to be configured as described above.
- the average obtained by averaging the time average power calculated for the first voice receiving means, which is one of the plurality of voice receiving means, and the time average power calculated for each voice receiving means It is possible to increase the degree of coincidence of the sound that is the basis of the sound signal that is the basis for calculating each of the time average powers.
- the input power of the voice signal is averaged with the reference power (time average power calculated for each voice receiving means) (Average time average power).
- the input power of the audio signal received by the first audio receiving means can be made closer to the reference power with higher accuracy.
- the correction function estimating means is configured to use a value stored in advance as the reference power.
- the correction function estimation unit is configured to estimate the correction function when the voice represented by the voice signal received by the voice reception unit is white noise.
- the speech detection method is as follows. Based on the audio signal received by the audio reception means that receives the input audio signal, an input power calculation process is performed to calculate, for each frequency, input power that represents the magnitude of the audio represented by the audio signal, Correction function estimation that estimates a correction function that is a continuous function that defines the relationship between the frequency and the correction coefficient for making the input power calculated for that frequency close to the reference power determined for that frequency Process, For each frequency, an input power correction process for correcting the input power is performed by multiplying the calculated input power by a correction coefficient acquired according to the relationship defined by the estimated correction function, This is a method of performing an utterance voice detection process for determining whether or not the voice represented by the received voice signal is an utterance voice based on the corrected input power.
- the correction function is preferably a polynomial function whose frequency is a variable.
- the above speech sound detection method is: It is preferable that the correction function is estimated so as to minimize a sum of values obtained by squaring a difference between the corrected input power and the reference power over a predetermined frequency range.
- the above speech sound detection method is: Obtaining noise power representing the magnitude of noise in the voice represented by the voice signal received by the voice receiving means for each frequency; For each frequency, the signal-to-noise ratio for each frequency is calculated by dividing the corrected input power by the acquired noise power, and a signal-to-noise value that is representative of the calculated signal-to-noise ratio for each frequency. To get the ratio When the acquired signal-to-noise ratio is larger than a preset threshold value, it is preferable that the voice represented by the received voice signal is determined to be a speech voice.
- Input power calculation means for performing input power calculation processing for calculating, for each frequency, input power representing the magnitude of the voice represented by the voice signal based on the voice signal received by the voice reception means for receiving the input voice signal; , Correction function estimation that estimates a correction function that is a continuous function that defines the relationship between the frequency and the correction coefficient for making the input power calculated for that frequency close to the reference power determined for that frequency
- Correction function estimation means for performing processing
- Input power correction means for performing input power correction processing for correcting the input power by multiplying the calculated input power by a correction coefficient acquired according to the relationship defined by the estimated correction function for each frequency.
- the correction function is preferably a polynomial function whose frequency is a variable.
- the correction function estimation means estimates the correction function that minimizes a sum of a value obtained by squaring the difference between the corrected input power and the reference power over a predetermined frequency range. It is preferable to be configured.
- the speech sound detection means is Noise power acquisition means for acquiring, for each frequency, noise power indicating the magnitude of noise in the voice represented by the voice signal received by the voice reception means; For each frequency, the signal-to-noise ratio for each frequency is calculated by dividing the corrected input power by the acquired noise power, and a signal-to-noise value that is representative of the calculated signal-to-noise ratio for each frequency.
- a signal to noise ratio acquisition means for acquiring a ratio, and When the acquired signal-to-noise ratio is larger than a preset threshold value, it is preferable that the voice represented by the received voice signal is determined to be a speech voice.
- the invention of the utterance voice detection method or the utterance voice detection program having the above-described configuration has the same operation as the above-mentioned utterance voice detection device, and therefore the above-described object of the present invention can be achieved. it can.
- the correction function estimating unit 14 the average time average power obtained by averaging the time-averaged power x i calculated by time average power calculation unit 13 for each of a plurality of microphones MC1 ⁇ MCL May be configured to be used as the reference power y i .
- the correction function estimation unit 14 may be configured to use a value stored in advance in the storage device as the reference power y i .
- the correction function estimation part 14 was comprised so that a correction function might be estimated when the audio
- the correction function may be estimated.
- the program is stored in the storage device, but may be stored in a computer-readable recording medium.
- the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, and a semiconductor memory.
- the present invention is applicable to an utterance voice detection system that includes a plurality of microphones and determines whether or not the voice input via each microphone is an utterance voice.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
入力された音声信号を受け付ける音声受付手段と、
上記音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行う入力パワー算出手段と、
周波数と、その周波数に対して上記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行う補正関数推定手段と、
周波数毎に、上記推定された補正関数により規定される関係に従って取得される補正係数を、上記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行う入力パワー補正手段と、
上記補正された入力パワーに基づいて、上記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う発話音声検出手段と、
を備える。
入力された音声信号を受け付ける音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行い、
周波数と、その周波数に対して上記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行い、
周波数毎に、上記推定された補正関数により規定される関係に従って取得される補正係数を、上記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行い、
上記補正された入力パワーに基づいて、上記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う、方法である。
情報処理装置に、
入力された音声信号を受け付ける音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行う入力パワー算出手段と、
周波数と、その周波数に対して上記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行う補正関数推定手段と、
周波数毎に、上記推定された補正関数により規定される関係に従って取得される補正係数を、上記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行う入力パワー補正手段と、
上記補正された入力パワーに基づいて、上記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う発話音声検出手段と、
を実現させるためのプログラムである。
図1に示したように、第1実施形態に係る発話音声検出装置1は、情報処理装置である。発話音声検出装置1は、図示しない中央処理装置(CPU;Central Processing Unit)、記憶装置(メモリ及びハードディスク駆動装置(HDD))、及び、入力装置を備える。
発話音声検出装置1のCPUは、図2にフローチャートにより示した発話音声検出プログラムを、所定の演算周期が経過する毎に実行するようになっている。
いま、受け付けた音声信号が白色雑音を表す音声信号である場合を想定して説明を続ける。この場合、発話音声検出装置1は、複数のマイクロフォンMC1~MCLのそれぞれに対して補正関数を推定する補正関数推定処理(記憶装置に記憶されている補正係数fiを更新する処理)を行う。
このようにして、CPUは、周波数毎に、雑音パワーNi(t)を複数のマイクロフォンMC1~MCLのそれぞれに対して取得する。
これによれば、多項式関数の次数Mを調整することにより、周波数の変化に対する、補正係数fiの変化の滑らかさの程度を調整することができる。
次に、本発明の第2実施形態に係る発話音声検出装置について図4を参照しながら説明する。
第2実施形態に係る発話音声検出装置1の機能は、音声受付部(音声受付手段)18と、入力パワー算出部(入力パワー算出手段)11と、入力パワー補正部(入力パワー補正手段)12と、補正関数推定部(補正関数推定手段)14と、発話音声検出部(発話音声検出手段)16と、を含む。
入力パワー算出部11は、音声受付部18により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行う。
を備える。
上記補正関数推定手段は、上記補正された入力パワーと、上記基準パワーと、の差を二乗した値の、所定の周波数の範囲にわたる和を最小とする上記補正関数を推定するように構成されることが好適である。
上記音声受付手段により受け付けられた音声信号が表す音声のうちの雑音の大きさを表す雑音パワーを周波数毎に取得する雑音パワー取得手段と、
周波数毎に、上記補正された入力パワーを上記取得された雑音パワーにより除することにより周波数毎信号対雑音比を算出し、当該算出した周波数毎信号対雑音比を代表する値である信号対雑音比を取得する信号対雑音比取得手段と、を含むとともに、
上記取得された信号対雑音比が予め設定された閾値よりも大きい場合、上記受け付けられた音声信号が表す音声が発話音声であると判定するように構成されることが好適である。
上記信号対雑音比取得手段は、上記算出された周波数毎信号対雑音比の最大値を上記信号対雑音比として取得するように構成されることが好適である。
上記音声受付手段を複数備えるとともに、
上記入力パワー算出手段は、上記入力パワー算出処理を上記複数の音声受付手段のそれぞれに対して行うように構成され、
上記補正関数推定手段は、上記補正関数推定処理を上記複数の音声受付手段のそれぞれに対して行うように構成され、
上記入力パワー補正手段は、上記入力パワー補正処理を上記複数の音声受付手段のそれぞれに対して行うように構成され、
上記発話音声検出手段は、
上記発話音声検出処理を上記複数の音声受付手段のそれぞれに対して行うように構成されるとともに、周波数毎に、上記入力パワー補正手段により上記複数の音声受付手段のそれぞれに対して補正された入力パワーのうちの最大の入力パワーを算出する基となった音声信号を受け付けた音声受付手段に対する雑音パワーとして、上記入力パワー補正手段により上記複数の音声受付手段のそれぞれに対して補正された入力パワーのうちの最小の入力パワーを用いるように構成されることが好適である。
周波数毎に、上記入力パワー補正手段により上記複数の音声受付手段のそれぞれに対して補正された入力パワーのうちの最大の入力パワーを算出する基となった音声信号を受け付けた音声受付手段以外の音声受付手段に対する雑音パワーとして、上記入力パワー補正手段により当該音声受付手段に対して補正された入力パワーを用いるように構成されることが好適である。
上記入力パワー算出手段は、上記音声受付手段により受け付けられた音声信号を所定のフレーム間隔毎に分割し、当該分割された各部分に対して上記入力パワーを周波数毎に算出するように構成され、
上記発話音声検出装置は、
上記入力パワー算出手段により上記音声信号の各部分に対して算出された入力パワーを平均した時間平均パワーを算出する時間平均パワー算出処理を、上記複数の音声受付手段のそれぞれに対して行う時間平均パワー算出手段を備え、
上記補正関数推定手段は、周波数と、その周波数に対して上記算出された時間平均パワーを、上記時間平均パワー算出手段により上記複数の音声受付手段の1つに対して算出され且つその周波数に対して算出された時間平均パワーに近づけるための補正係数と、の関係を規定する上記補正関数を推定する上記補正関数推定処理を上記複数の音声受付手段のそれぞれに対して行うように構成されることが好適である。
上記補正関数推定手段は、上記入力パワー算出手段により上記複数の音声受付手段のそれぞれに対して算出された入力パワーを平均した平均パワーを上記基準パワーとして用いるように構成されることが好適である。
上記入力パワー算出手段は、上記音声受付手段により受け付けられた音声信号を所定のフレーム間隔毎に分割し、当該分割された各部分に対して上記入力パワーを周波数毎に算出するように構成され、
上記発話音声検出装置は、
上記入力パワー算出手段により上記音声信号の各部分に対して算出された入力パワーを平均した時間平均パワーを算出する時間平均パワー算出処理を、上記複数の音声受付手段のそれぞれに対して行う時間平均パワー算出手段を備え、
上記補正関数推定手段は、周波数と、その周波数に対して上記算出された時間平均パワーを、上記時間平均パワー算出手段により上記複数の音声受付手段のそれぞれに対して算出され且つその周波数に対して算出された時間平均パワーを平均した平均時間平均パワーに近づけるための補正係数と、の関係を規定する上記補正関数を推定する上記補正関数推定処理を上記複数の音声受付手段のそれぞれに対して行うように構成されることが好適である。
入力された音声信号を受け付ける音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行い、
周波数と、その周波数に対して上記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行い、
周波数毎に、上記推定された補正関数により規定される関係に従って取得される補正係数を、上記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行い、
上記補正された入力パワーに基づいて、上記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う、方法である。
上記補正された入力パワーと、上記基準パワーと、の差を二乗した値の、所定の周波数の範囲にわたる和を最小とする上記補正関数を推定するように構成されることが好適である。
上記音声受付手段により受け付けられた音声信号が表す音声のうちの雑音の大きさを表す雑音パワーを周波数毎に取得し、
周波数毎に、上記補正された入力パワーを上記取得された雑音パワーにより除することにより周波数毎信号対雑音比を算出し、当該算出した周波数毎信号対雑音比を代表する値である信号対雑音比を取得し、
上記取得された信号対雑音比が予め設定された閾値よりも大きい場合、上記受け付けられた音声信号が表す音声が発話音声であると判定するように構成されることが好適である。
情報処理装置に、
入力された音声信号を受け付ける音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行う入力パワー算出手段と、
周波数と、その周波数に対して上記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行う補正関数推定手段と、
周波数毎に、上記推定された補正関数により規定される関係に従って取得される補正係数を、上記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行う入力パワー補正手段と、
上記補正された入力パワーに基づいて、上記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う発話音声検出手段と、
を実現させるためのプログラムである。
上記音声受付手段により受け付けられた音声信号が表す音声のうちの雑音の大きさを表す雑音パワーを周波数毎に取得する雑音パワー取得手段と、
周波数毎に、上記補正された入力パワーを上記取得された雑音パワーにより除することにより周波数毎信号対雑音比を算出し、当該算出した周波数毎信号対雑音比を代表する値である信号対雑音比を取得する信号対雑音比取得手段と、を含むとともに、
上記取得された信号対雑音比が予め設定された閾値よりも大きい場合、上記受け付けられた音声信号が表す音声が発話音声であると判定するように構成されることが好適である。
11 入力パワー算出部
12 入力パワー補正部
13 時間平均パワー算出部
14 補正関数推定部
15 補正関数記憶部
16 発話音声検出部
16a 雑音パワー取得部
16b 信号対雑音比取得部
18 音声受付部
MC1~MCL マイクロフォン
Claims (22)
- 入力された音声信号を受け付ける音声受付手段と、
前記音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行う入力パワー算出手段と、
周波数と、その周波数に対して前記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行う補正関数推定手段と、
周波数毎に、前記推定された補正関数により規定される関係に従って取得される補正係数を、前記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行う入力パワー補正手段と、
前記補正された入力パワーに基づいて、前記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う発話音声検出手段と、
を備える発話音声検出装置。 - 請求項1に記載の発話音声検出装置であって、
前記補正関数は、周波数を変数とする多項式関数である発話音声検出装置。 - 請求項1又は請求項2に記載の発話音声検出装置であって、
前記補正関数推定手段は、前記補正された入力パワーと、前記基準パワーと、の差を二乗した値の、所定の周波数の範囲にわたる和を最小とする前記補正関数を推定するように構成された発話音声検出装置。 - 請求項1乃至請求項3のいずれか一項に記載の発話音声検出装置であって、
前記発話音声検出手段は、
前記音声受付手段により受け付けられた音声信号が表す音声のうちの雑音の大きさを表す雑音パワーを周波数毎に取得する雑音パワー取得手段と、
周波数毎に、前記補正された入力パワーを前記取得された雑音パワーにより除することにより周波数毎信号対雑音比を算出し、当該算出した周波数毎信号対雑音比を代表する値である信号対雑音比を取得する信号対雑音比取得手段と、を含むとともに、
前記取得された信号対雑音比が予め設定された閾値よりも大きい場合、前記受け付けられた音声信号が表す音声が発話音声であると判定するように構成された発話音声検出装置。 - 請求項4に記載の発話音声検出装置であって、
前記信号対雑音比取得手段は、前記算出された周波数毎信号対雑音比の、所定の周波数の範囲にわたる和を前記信号対雑音比として取得するように構成された発話音声検出装置。 - 請求項4に記載の発話音声検出装置であって、
前記信号対雑音比取得手段は、前記算出された周波数毎信号対雑音比の最大値を前記信号対雑音比として取得するように構成された発話音声検出装置。 - 請求項4乃至請求項6のいずれか一項に記載の発話音声検出装置であって、
前記音声受付手段を複数備えるとともに、
前記入力パワー算出手段は、前記入力パワー算出処理を前記複数の音声受付手段のそれぞれに対して行うように構成され、
前記補正関数推定手段は、前記補正関数推定処理を前記複数の音声受付手段のそれぞれに対して行うように構成され、
前記入力パワー補正手段は、前記入力パワー補正処理を前記複数の音声受付手段のそれぞれに対して行うように構成され、
前記発話音声検出手段は、
前記発話音声検出処理を前記複数の音声受付手段のそれぞれに対して行うように構成されるとともに、周波数毎に、前記入力パワー補正手段により前記複数の音声受付手段のそれぞれに対して補正された入力パワーのうちの最大の入力パワーを算出する基となった音声信号を受け付けた音声受付手段に対する雑音パワーとして、前記入力パワー補正手段により前記複数の音声受付手段のそれぞれに対して補正された入力パワーのうちの最小の入力パワーを用いるように構成された発話音声検出装置。 - 請求項7に記載の発話音声検出装置であって、
前記発話音声検出手段は、
周波数毎に、前記入力パワー補正手段により前記複数の音声受付手段のそれぞれに対して補正された入力パワーのうちの最大の入力パワーを算出する基となった音声信号を受け付けた音声受付手段以外の音声受付手段に対する雑音パワーとして、前記入力パワー補正手段により当該音声受付手段に対して補正された入力パワーを用いるように構成された発話音声検出装置。 - 請求項7又は請求項8に記載の発話音声検出装置であって、
前記補正関数推定手段は、前記入力パワー算出手段により前記複数の音声受付手段の1つに対して算出された入力パワーを前記基準パワーとして用いるように構成された発話音声検出装置。 - 請求項9に記載の発話音声検出装置であって、
前記入力パワー算出手段は、前記音声受付手段により受け付けられた音声信号を所定のフレーム間隔毎に分割し、当該分割された各部分に対して前記入力パワーを周波数毎に算出するように構成され、
前記発話音声検出装置は、
前記入力パワー算出手段により前記音声信号の各部分に対して算出された入力パワーを平均した時間平均パワーを算出する時間平均パワー算出処理を、前記複数の音声受付手段のそれぞれに対して行う時間平均パワー算出手段を備え、
前記補正関数推定手段は、周波数と、その周波数に対して前記算出された時間平均パワーを、前記時間平均パワー算出手段により前記複数の音声受付手段の1つに対して算出され且つその周波数に対して算出された時間平均パワーに近づけるための補正係数と、の関係を規定する前記補正関数を推定する前記補正関数推定処理を前記複数の音声受付手段のそれぞれに対して行うように構成された発話音声検出装置。 - 請求項7又は請求項8に記載の発話音声検出装置であって、
前記補正関数推定手段は、前記入力パワー算出手段により前記複数の音声受付手段のそれぞれに対して算出された入力パワーを平均した平均パワーを前記基準パワーとして用いるように構成された発話音声検出装置。 - 請求項11に記載の発話音声検出装置であって、
前記入力パワー算出手段は、前記音声受付手段により受け付けられた音声信号を所定のフレーム間隔毎に分割し、当該分割された各部分に対して前記入力パワーを周波数毎に算出するように構成され、
前記発話音声検出装置は、
前記入力パワー算出手段により前記音声信号の各部分に対して算出された入力パワーを平均した時間平均パワーを算出する時間平均パワー算出処理を、前記複数の音声受付手段のそれぞれに対して行う時間平均パワー算出手段を備え、
前記補正関数推定手段は、周波数と、その周波数に対して前記算出された時間平均パワーを、前記時間平均パワー算出手段により前記複数の音声受付手段のそれぞれに対して算出され且つその周波数に対して算出された時間平均パワーを平均した平均時間平均パワーに近づけるための補正係数と、の関係を規定する前記補正関数を推定する前記補正関数推定処理を前記複数の音声受付手段のそれぞれに対して行うように構成された発話音声検出装置。 - 請求項1乃至請求項12のいずれか一項に記載の発話音声検出装置であって、
前記補正関数推定手段は、予め記憶された値を前記基準パワーとして用いるように構成された発話音声検出装置。 - 請求項1乃至請求項13のいずれか一項に記載の発話音声検出装置であって、
前記補正関数推定手段は、前記音声受付手段により受け付けられた音声信号が表す音声が白色雑音である場合、前記補正関数を推定するように構成された発話音声検出装置。 - 入力された音声信号を受け付ける音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行い、
周波数と、その周波数に対して前記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行い、
周波数毎に、前記推定された補正関数により規定される関係に従って取得される補正係数を、前記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行い、
前記補正された入力パワーに基づいて、前記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う、発話音声検出方法。 - 請求項15に記載の発話音声検出方法であって、
前記補正関数は、周波数を変数とする多項式関数である発話音声検出方法。 - 請求項15又は請求項16に記載の発話音声検出方法であって、
前記補正された入力パワーと、前記基準パワーと、の差を二乗した値の、所定の周波数の範囲にわたる和を最小とする前記補正関数を推定するように構成された発話音声検出方法。 - 請求項15乃至請求項17のいずれか一項に記載の発話音声検出方法であって、
前記音声受付手段により受け付けられた音声信号が表す音声のうちの雑音の大きさを表す雑音パワーを周波数毎に取得し、
周波数毎に、前記補正された入力パワーを前記取得された雑音パワーにより除することにより周波数毎信号対雑音比を算出し、当該算出した周波数毎信号対雑音比を代表する値である信号対雑音比を取得し、
前記取得された信号対雑音比が予め設定された閾値よりも大きい場合、前記受け付けられた音声信号が表す音声が発話音声であると判定するように構成された発話音声検出方法。 - 情報処理装置に、
入力された音声信号を受け付ける音声受付手段により受け付けられた音声信号に基づいて、その音声信号が表す音声の大きさを表す入力パワーを周波数毎に算出する入力パワー算出処理を行う入力パワー算出手段と、
周波数と、その周波数に対して前記算出された入力パワーをその周波数に対して定められた基準パワーに近づけるための補正係数と、の関係を規定する連続関数である補正関数を推定する補正関数推定処理を行う補正関数推定手段と、
周波数毎に、前記推定された補正関数により規定される関係に従って取得される補正係数を、前記算出された入力パワーに乗じることにより、当該入力パワーを補正する入力パワー補正処理を行う入力パワー補正手段と、
前記補正された入力パワーに基づいて、前記受け付けられた音声信号が表す音声が発話音声であるか否かを判定する発話音声検出処理を行う発話音声検出手段と、
を実現させるための発話音声検出プログラム。 - 請求項19に記載の発話音声検出プログラムであって、
前記補正関数は、周波数を変数とする多項式関数である発話音声検出プログラム。 - 請求項19又は請求項20に記載の発話音声検出プログラムであって、
前記補正関数推定手段は、前記補正された入力パワーと、前記基準パワーと、の差を二乗した値の、所定の周波数の範囲にわたる和を最小とする前記補正関数を推定するように構成された発話音声検出プログラム。 - 請求項19乃至請求項21のいずれか一項に記載の発話音声検出プログラムであって、
前記発話音声検出手段は、
前記音声受付手段により受け付けられた音声信号が表す音声のうちの雑音の大きさを表す雑音パワーを周波数毎に取得する雑音パワー取得手段と、
周波数毎に、前記補正された入力パワーを前記取得された雑音パワーにより除することにより周波数毎信号対雑音比を算出し、当該算出した周波数毎信号対雑音比を代表する値である信号対雑音比を取得する信号対雑音比取得手段と、を含むとともに、
前記取得された信号対雑音比が予め設定された閾値よりも大きい場合、前記受け付けられた音声信号が表す音声が発話音声であると判定するように構成された発話音声検出プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/125,493 US8856001B2 (en) | 2008-11-27 | 2009-09-03 | Speech sound detection apparatus |
JP2010540300A JP5459220B2 (ja) | 2008-11-27 | 2009-09-03 | 発話音声検出装置 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-302242 | 2008-11-27 | ||
JP2008302242 | 2008-11-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010061505A1 true WO2010061505A1 (ja) | 2010-06-03 |
Family
ID=42225397
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/004339 WO2010061505A1 (ja) | 2008-11-27 | 2009-09-03 | 発話音声検出装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8856001B2 (ja) |
JP (1) | JP5459220B2 (ja) |
WO (1) | WO2010061505A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014228753A (ja) * | 2013-05-23 | 2014-12-08 | 富士通株式会社 | 音声処理装置、音声処理方法および音声処理プログラム |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010061506A1 (ja) * | 2008-11-27 | 2010-06-03 | 日本電気株式会社 | 信号補正装置 |
US10431243B2 (en) * | 2013-04-11 | 2019-10-01 | Nec Corporation | Signal processing apparatus, signal processing method, signal processing program |
US9685156B2 (en) * | 2015-03-12 | 2017-06-20 | Sony Mobile Communications Inc. | Low-power voice command detector |
CN106887241A (zh) | 2016-10-12 | 2017-06-23 | 阿里巴巴集团控股有限公司 | 一种语音信号检测方法与装置 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH075895A (ja) * | 1993-04-20 | 1995-01-10 | Clarion Co Ltd | 音声認識装置及び騒音環境での音声認識方法 |
JP2007068125A (ja) * | 2005-09-02 | 2007-03-15 | Nec Corp | 信号処理の方法及び装置並びにコンピュータプログラム |
JP2007328228A (ja) * | 2006-06-09 | 2007-12-20 | Sony Corp | 信号処理装置、信号処理方法、及びプログラム |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7567900B2 (en) * | 2003-06-11 | 2009-07-28 | Panasonic Corporation | Harmonic structure based acoustic speech interval detection method and device |
US8311819B2 (en) * | 2005-06-15 | 2012-11-13 | Qnx Software Systems Limited | System for detecting speech with background voice estimates and noise estimates |
JP4746533B2 (ja) | 2006-12-21 | 2011-08-10 | 日本電信電話株式会社 | 多音源有音区間判定装置、方法、プログラム及びその記録媒体 |
JP5134477B2 (ja) * | 2008-09-17 | 2013-01-30 | 日本電信電話株式会社 | 目的信号区間推定装置、目的信号区間推定方法、目的信号区間推定プログラム及び記録媒体 |
-
2009
- 2009-09-03 JP JP2010540300A patent/JP5459220B2/ja active Active
- 2009-09-03 WO PCT/JP2009/004339 patent/WO2010061505A1/ja active Application Filing
- 2009-09-03 US US13/125,493 patent/US8856001B2/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH075895A (ja) * | 1993-04-20 | 1995-01-10 | Clarion Co Ltd | 音声認識装置及び騒音環境での音声認識方法 |
JP2007068125A (ja) * | 2005-09-02 | 2007-03-15 | Nec Corp | 信号処理の方法及び装置並びにコンピュータプログラム |
JP2007328228A (ja) * | 2006-06-09 | 2007-12-20 | Sony Corp | 信号処理装置、信号処理方法、及びプログラム |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014228753A (ja) * | 2013-05-23 | 2014-12-08 | 富士通株式会社 | 音声処理装置、音声処理方法および音声処理プログラム |
Also Published As
Publication number | Publication date |
---|---|
US8856001B2 (en) | 2014-10-07 |
JP5459220B2 (ja) | 2014-04-02 |
JPWO2010061505A1 (ja) | 2012-04-19 |
US20110202339A1 (en) | 2011-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5452655B2 (ja) | 音声状態モデルを使用したマルチセンサ音声高品質化 | |
KR100883712B1 (ko) | 음원 방향 추정 방법, 및 음원 방향 추정 장치 | |
JP5197458B2 (ja) | 受音信号処理装置、方法およびプログラム | |
US20170140771A1 (en) | Information processing apparatus, information processing method, and computer program product | |
EP2773137B1 (en) | Microphone sensitivity difference correction device | |
US8509451B2 (en) | Noise suppressing device, noise suppressing controller, noise suppressing method and recording medium | |
JP5381982B2 (ja) | 音声検出装置、音声検出方法、音声検出プログラム及び記録媒体 | |
US8693287B2 (en) | Sound direction estimation apparatus and sound direction estimation method | |
JP2019503107A (ja) | 音響信号を向上させるための音響信号処理装置および方法 | |
US20090232318A1 (en) | Output correcting device and method, and loudspeaker output correcting device and method | |
US10679641B2 (en) | Noise suppression device and noise suppressing method | |
JP2005165021A (ja) | 雑音低減装置、および低減方法 | |
JP2009163105A (ja) | 音声明瞭度改善システム及び音声明瞭度改善方法 | |
JP5459220B2 (ja) | 発話音声検出装置 | |
US20130156221A1 (en) | Signal processing apparatus and signal processing method | |
US8259961B2 (en) | Audio processing apparatus and program | |
US11437054B2 (en) | Sample-accurate delay identification in a frequency domain | |
JP5494492B2 (ja) | 信号補正装置 | |
JP4858663B2 (ja) | 音声認識方法及び音声認識装置 | |
JP5772591B2 (ja) | 音声信号処理装置 | |
WO2020110228A1 (ja) | 情報処理装置、プログラム及び情報処理方法 | |
US9659575B2 (en) | Signal processor and method therefor | |
CN112133320A (zh) | 语音处理装置及语音处理方法 | |
US20130044890A1 (en) | Information processing device, information processing method and program | |
JP6361360B2 (ja) | 残響判定装置及びプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09828755 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13125493 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2010540300 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09828755 Country of ref document: EP Kind code of ref document: A1 |