WO2014168022A1 - 信号処理装置、信号処理方法および信号処理プログラム - Google Patents
信号処理装置、信号処理方法および信号処理プログラム Download PDFInfo
- Publication number
- WO2014168022A1 WO2014168022A1 PCT/JP2014/058962 JP2014058962W WO2014168022A1 WO 2014168022 A1 WO2014168022 A1 WO 2014168022A1 JP 2014058962 W JP2014058962 W JP 2014058962W WO 2014168022 A1 WO2014168022 A1 WO 2014168022A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- unit
- voice
- norm
- signal processing
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 64
- 238000003672 processing method Methods 0.000 title claims description 4
- 230000008859 change Effects 0.000 claims abstract description 58
- 238000004364 calculation method Methods 0.000 claims abstract description 35
- 238000006243 chemical reaction Methods 0.000 claims abstract description 30
- 238000004458 analytical method Methods 0.000 claims abstract description 27
- 238000009499 grossing Methods 0.000 claims description 33
- 230000010354 integration Effects 0.000 claims description 28
- 238000001228 spectrum Methods 0.000 description 72
- 238000010586 diagram Methods 0.000 description 34
- 238000001514 detection method Methods 0.000 description 30
- 230000006870 function Effects 0.000 description 24
- 238000000034 method Methods 0.000 description 23
- 230000003595 spectral effect Effects 0.000 description 12
- 230000015556 catabolic process Effects 0.000 description 5
- 238000006731 degradation reaction Methods 0.000 description 5
- 230000009466 transformation Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 230000006866 deterioration Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000009191 jumping Effects 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 206010002953 Aphonia Diseases 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 206010011469 Crying Diseases 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Definitions
- the present invention relates to a technique for detecting a human voice in an input signal.
- Patent Document 1 discloses a technique for detecting a voice by determining a background noise level of an input voice frame and comparing a threshold corresponding to the noise level with a volume of the input voice frame. Yes.
- An object of the present invention is to provide a technique for solving the above-described problems.
- a signal processing apparatus provides: Conversion means for converting the input signal into an amplitude component signal in the frequency domain; Calculating means for calculating a norm of a change in the amplitude component signal in the frequency direction; Integrating means for integrating the norm of the change calculated by the calculating means; Analyzing means for analyzing the sound in the input signal according to the integrated value calculated by the integrating means; Equipped with.
- a signal processing method includes: A conversion step of converting the input signal into an amplitude component signal in the frequency domain; A calculation step of calculating a norm of a change in the amplitude component signal in the frequency direction; An integration step of integrating the norm of the change calculated in the calculation step; Equipped with.
- a signal processing program provides: A conversion step of converting the input signal into an amplitude component signal in the frequency domain; A calculation step of calculating a norm of a change in the amplitude component signal in the frequency direction; An integration step of integrating the norm of the change calculated in the calculation step; Is executed on the computer.
- the accuracy of sound determination can be improved.
- the “voice signal” is a direct electrical change that occurs in accordance with voice and other sounds, and is used to transmit voice and other sounds, and is not limited to voice.
- the signal processing device 100 is a device that determines the presence of sound in an input signal.
- the signal processing apparatus 100 includes a conversion unit 101, a frequency direction difference calculation unit 102, an integration unit 103, and an analysis unit 104.
- the conversion unit 101 converts the input signal 110 into an amplitude component signal 130 in the frequency domain.
- the frequency direction difference calculation unit 102 calculates the norm of the change of the amplitude component signal 102 in the frequency direction.
- the integration unit 103 integrates the norm of the change calculated by the calculation unit 102.
- the analysis unit 104 analyzes the voice in the input signal 110 according to the integration value 150 calculated by the integration unit 103.
- the noise is smooth in the frequency direction, while the voice has a large change in the frequency direction.
- the hard decision (0/1) may be performed by comparing the integrated value with a threshold value, or the integrated value itself may be rounded and the soft decision (0-256) may be performed.
- FIG. 2 is a diagram for explaining a functional configuration of the signal processing apparatus according to the present embodiment.
- the signal processing device 200 includes a conversion unit 201, a frequency direction difference calculation unit 202, an integration unit 203, an analysis unit 204, and a frequency direction smoothing unit 205.
- the conversion unit 201 converts the input signal 210 into an amplitude component signal 230 in the frequency domain.
- the frequency direction smoothing unit 205 smoothes the amplitude component signal 230 in the frequency direction.
- the frequency direction difference calculation unit 202 calculates the norm of the change in the frequency direction of the smoothed amplitude component signal 230.
- the integrating unit 203 integrates the norm of the change calculated by the frequency direction difference calculating unit 202.
- the analysis unit 204 determines the presence of a female voice or a child voice based on the integrated value 250 calculated by the integrating unit 203. Note that the analysis unit 204 may determine the presence of a scream.
- This embodiment focuses on the fact that the female voice and the child voice are more gradual than the male voice. Since male voices are densely fluctuated, smoothing in the frequency direction results in a smooth curve and a waveform similar to noise. Thereby, a female voice and a child voice can be extracted accurately.
- the scream has a voice pitch higher than usual and has characteristics similar to a female voice or a child voice.
- FIG. 3 is a diagram for explaining a functional configuration of the signal processing apparatus according to the present embodiment.
- the signal processing device 300 includes a conversion unit 301, a frequency direction difference calculation unit 302, an integration unit 303, an analysis unit 304, and a time direction smoothing unit 305.
- the converter 301 converts the input signal 310 into an amplitude component signal 330 in the frequency domain.
- the time direction smoothing unit 305 smoothes the amplitude component signal 330 in the time direction.
- the frequency direction difference calculation unit 302 calculates the norm of the change in the frequency direction of the smoothed amplitude component signal.
- the integrating unit 303 integrates the norm of the change calculated by the frequency direction difference calculating unit 302.
- the analysis unit 304 determines the presence of a male voice based on the integrated value 350 calculated by the integrating unit 203.
- whether or not male voice is mixed in the input signal 310 is determined. This is useful when it is desired to determine whether or not there is a man in the target space. For example, by incorporating a male intrusion and alerting a male-forbidden place such as a girls' dormitory, false alarms can be eliminated and the alert can be performed more accurately.
- FIG. 4 is a diagram for explaining a functional configuration of the signal processing apparatus according to the present embodiment.
- the signal processing device 400 includes a conversion unit 401, frequency direction difference calculation units 402 and 412, integration units 403 and 413, an analysis unit 404, a frequency direction smoothing unit 405, and a time direction smoothing unit 415.
- the conversion unit 401 converts the input signal 410 into an amplitude component signal 430 in the frequency domain.
- the frequency direction smoothing unit 405 smoothes the amplitude component signal 430 in the frequency direction.
- the time direction smoothing unit 415 smoothes the amplitude component signal 430 in the time direction.
- the frequency direction difference calculation units 402 and 412 calculate the norm of the change in the frequency direction of the smoothed amplitude component signal.
- Integration units 403 and 413 integrate the norm of the change calculated by frequency direction difference calculation units 402 and 412.
- the analysis unit 404 determines the presence of a male voice and the presence of a female voice and a child voice based on the integrated values calculated by the integrating units 403 and 413.
- the recognition accuracy can be improved by using a recognition dictionary for male voices or using a recognition dictionary for female voices and child voices in combination with a voice recognition technique or the like.
- the signal processing apparatus appropriately suppresses non-stationary noise such as wind noise, for example.
- non-stationary noise such as wind noise
- the input sound is not limited to voice.
- voice will be described as a representative example of input sound.
- FIG. 5 is a block diagram showing the overall configuration of the signal processing device 200.
- a degradation signal (a signal in which a desired signal and noise are mixed) is supplied to the input terminal 506 as a sample value series.
- the degradation signal supplied to the input terminal 506 is subjected to transformation such as Fourier transformation in the transformation unit 501 and is divided into a plurality of frequency components.
- a plurality of frequency components are processed independently for each frequency.
- the description will be continued focusing on a specific frequency component.
- is supplied to the stationary component estimation unit 502, the replacement unit 503, and the speech detection unit 505, and the phase spectrum (phase component) 520 is converted into an inverse conversion unit. 504 is supplied.
- the conversion unit 501 supplies the deteriorated signal amplitude spectrum
- the steady component estimation unit 502 estimates a steady component included in the degradation signal amplitude spectrum
- the sound detection unit 505 determines whether or not sound is included for each frequency in the degraded signal amplitude spectrum
- the function for obtaining the amplitude spectrum used for replacement is not limited to the linear mapping function of N (k, n) represented by ⁇ (k, n) N (k, n).
- N (k, n) represented by ⁇ (k, n) N (k, n).
- C (k, n)>0 the level of the amplitude spectrum for replacement can be improved as a whole, so that a steady feeling when heard is improved.
- the inverse conversion unit 504 synthesizes the deteriorated signal phase spectrum 520 supplied from the conversion unit 501 and the enhanced signal amplitude spectrum Y (k, n) supplied from the replacement unit 503 and performs inverse conversion to obtain an enhancement signal. , And supplied to the output terminal 507.
- the voice distortion due to the suppression can be avoided.
- FIG. 6 is a diagram for explaining another example of the signal processing apparatus according to the present embodiment.
- the signal processing device 600 according to the present embodiment has a probability p that the sound detection unit 605 includes sound for each frequency in the degraded signal amplitude spectrum
- p (k, n) is a real number from 0 to 1.
- the replacement unit 603 performs a replacement process according to the speech existence probability p (k, n). Since other configurations and operations are the same as those in FIG. 5, the same configurations and operations are denoted by the same reference numerals, and detailed description thereof is omitted.
- the replacement unit 603 replaces the deteriorated signal amplitude spectrum
- the output signal Y (k, n) ⁇ (p (k, n)) N () using the function ⁇ (p (k, n)) of p (k, n) whose range is 0 to 1.
- FIG. 7 is a block diagram illustrating a configuration of the conversion unit 501.
- the converting unit 501 includes a frame dividing unit 711, a windowing unit 712, and a Fourier transform unit 713.
- the deteriorated signal samples are supplied to the frame dividing unit 711 and divided into frames for each K / 2 sample.
- K is an even number.
- the degraded signal sample divided into frames is supplied to a windowing processing unit 712, and is multiplied by w (t) which is a window function.
- a symmetric window function is used for real signals.
- the windowed output is supplied to the Fourier transform unit 713 and converted into a degraded signal spectrum
- is separated into a phase and an amplitude, and the deteriorated signal phase spectrum arg
- a power spectrum can be used instead of an amplitude spectrum.
- FIG. 8 is a block diagram illustrating a configuration of the inverse transform unit 504.
- the inverse transform unit 504 includes an inverse Fourier transform unit 811, a windowing processing unit 812, and a frame composition unit 813.
- the inverse Fourier transform unit 811 includes the enhanced signal amplitude spectrum (
- the data is supplied to the processing unit 242, and is multiplied with the window function w (t).
- the obtained output signal is transmitted from the frame synthesis unit 813 to the output terminal 507. 7 and 8, the transformation in the transform unit 501 and the inverse transform unit 504 has been described as Fourier transform. However, other transforms such as Hadamard transform, Haar transform, and wavelet transform can be used instead of Fourier transform. .
- the Haar transform does not require multiplication and can reduce the area when the LSI is formed. Since the wavelet transform can change the time resolution depending on the frequency, an improvement in the noise suppression effect can be expected.
- the number of frequency components after integration is smaller than the number of frequency components before integration.
- a common stationary component spectrum may be obtained for the integrated frequency components obtained by integrating the frequency components and used in common for individual frequency components belonging to the same integrated frequency component.
- the stationary component spectrum is a stationary component included in the input signal amplitude spectrum.
- the stationary component has a feature that the time change of the power is smaller than that of the input signal.
- the time change is generally calculated as a difference or a ratio.
- the time change is calculated as a difference, when the input signal amplitude spectrum and the stationary component spectrum are compared in a certain frame n, there is at least one frequency k that satisfies the relationship of the following equation.
- N (k, n) is not a stationary component spectrum when the left side of the above equation is necessarily larger than the right side in all frames n and frequencies k.
- the function can be defined in the same way even if the function is an exponent of X and N, logarithm, or power.
- Non-Patent Document 1 discloses a method in which an estimated noise spectrum is an average value of a deteriorated signal amplitude spectrum of a frame in which a target sound is not generated. In this method, it is necessary to detect the generation of the target sound. The section in which the target sound is generated can be determined by the power of the enhancement signal.
- the enhancement signal is the target sound other than noise. Also, the target sound and noise levels do not change greatly between adjacent frames. For these reasons, the enhancement signal level in the past of one frame is used as an index for determining the noise interval. When the emphasized signal power of one frame in the past is below a certain value, the current frame is determined as a noise section.
- the noise spectrum can be estimated by averaging the deteriorated signal amplitude spectrum of the frame determined as the noise interval.
- Non-Patent Document 1 also discloses a method in which the estimated noise spectrum is an average value at the initial stage of estimation when the deteriorated signal amplitude spectrum is supplied. In this case, it is necessary to satisfy the condition that the target sound is not included immediately after the estimation is started. When the condition is satisfied, the degradation signal amplitude spectrum at the initial stage of estimation can be set as the estimated noise spectrum.
- Non-Patent Document 2 discloses a method for obtaining an estimated noise spectrum from a minimum value (minimum statistic) of a deteriorated signal amplitude spectrum.
- the minimum value of the degradation signal amplitude spectrum in a fixed time is held, and the noise spectrum is estimated from the minimum value. Since the minimum value of the degraded signal amplitude spectrum is similar to the spectrum shape of the noise spectrum, it can be used as an estimated value of the noise spectrum shape. However, the minimum value is smaller than the original noise level. Therefore, an estimated noise spectrum is obtained by appropriately amplifying the minimum value.
- an estimated noise spectrum may be obtained using a median filter.
- the estimated noise spectrum may be obtained using WiNE (Weighted Noise Estimation), which is a noise estimation method that follows the changing noise, utilizing the property that the noise fluctuates slowly.
- FIG. 9 is a diagram illustrating an exemplary configuration of the voice detection units 505 and 605.
- the sound detection units 505 and 605 include a frequency direction difference calculation unit 902, an integration unit 903, and an analysis unit 904.
- the frequency direction difference calculation unit 902 calculates the norm of the change in the amplitude component signal in the frequency direction.
- the change in the frequency direction mainly means a difference or ratio between adjacent frequency components. For example, when the change is defined as a difference, if the amplitude component signal is
- D (k, n) Lm
- the range of k may be limited for the purpose of reducing the amount of calculation. At this time, since audio components are concentrated in the low frequency range, it is better to adopt a small value for k, that is, a value belonging to the low frequency range. Further, when the number of frequency bins is large, the difference between k-1 and k + 1, and k-2 and k may be calculated instead of k-1 and k.
- the accumulating unit 903 accumulates the norm of the change calculated by the frequency direction difference calculating unit 902.
- the analysis unit 904 in the sound detection unit 505 compares the integrated value 950 calculated by the integration unit 903 with a threshold value stored in advance, thereby determining whether or not there is a sound in the degraded amplitude component signal
- the analysis unit 904 in the voice detection unit 605 rounds the integration value 950 calculated by the integration unit 903, and the existence probability p (k, n) of the speech in the degraded amplitude component signal
- the analysis unit 904 may determine the presence of a specific person's voice by comparing the integrated value with an integrated value relating to the voice of the specific person stored in advance.
- (Spectral shape in an example of a voice detector) 10 and 11 are diagrams showing a degraded signal amplitude spectrum (input signal spectrum)
- FIG. 10 when the total sum of the norms of the amplitude differences between adjacent frequencies is small, it is determined as noise.
- the sum of the norms of the amplitude differences at adjacent frequencies is large, it is determined that the voice (desired voice, target sound).
- FIG. 12 is a diagram illustrating a configuration of another example of the sound detection units 505 and 605.
- the voice detection units 505 and 605 in this example include frequency direction difference calculation units 1202 and 1212, integration units 1203 and 1213, an analysis unit 1204, a frequency direction smoothing unit 1205, and a time direction smoothing unit 1215.
- the frequency direction smoothing unit 1205 smoothes the deteriorated amplitude component signal
- smoothing examples include moving average and first-order leak integration.
- moving average is adopted as the smoothing means, assuming that the amplitude component signal is
- the time direction smoothing unit 1215 smoothes the deteriorated amplitude component signal
- Frequency direction difference calculation sections 1202 and 1212 calculate the norm of the change in the frequency direction of the smoothed amplitude component signal.
- Integration units 1203 and 1213 integrate the norms of changes calculated by frequency direction difference calculation units 1202 and 1212.
- the analysis unit 1204 determines the presence of a male voice and the presence of a female voice and a child voice based on the integrated values calculated by the integrating units 1203 and 1213.
- the analysis unit 1204 in the voice detection unit 505 compares the integrated value calculated by the integrating unit 1203 with a threshold value stored in advance, and the integrated value calculated by the integrating unit 1213 is another threshold value stored in advance. And the presence / absence (0/1) of the voice in the deteriorated amplitude component signal
- the analysis unit 1204 in the voice detection unit 605 adds and rounds the integration value calculated by the integration unit 1203 and the integration value calculated by the integration unit 1213 to round the deterioration amplitude component signal
- the analysis unit 1204 may determine the presence of a voice of a specific person by comparing the integrated value with an integrated value related to a specific male or female voice stored in advance.
- FIG. 13 is a diagram for explaining a difference in spectrum shape depending on gender.
- the female voice and the child voice are more gradual than the male voice. Since male voices are densely fluctuated, smoothing in the frequency direction results in a smooth curve and a waveform similar to noise. That is, by using the frequency direction smoothing unit 1205, the female voice and the child voice can be accurately extracted.
- the female voice and the child voice change moderately, smoothing in the time direction results in a smooth curve and a waveform similar to noise. That is, the male voice can be accurately extracted by using the time direction smoothing unit 1215.
- FIG. 14 is a diagram illustrating a change in the spectrum shape of the output signal Y (k, n) according to the value of p (k, n).
- the spectral shape is closer to
- ⁇ (k, n) when the S / N ratio is high, it is quiet, so ⁇ (k, n) may be reduced and strongly suppressed. Conversely, when the S / N ratio is high, the noise is small, so it is conceivable to leave ⁇ (k, n) at 1.
- ⁇ (k, n) is a function that is sufficiently small if k is greater than a certain threshold, or ⁇ k that decreases as k increases. May be a monotonically decreasing function.
- noise can be made steady according to the possibility of the presence of speech, and unsteady noise such as wind noise can be suppressed while effectively avoiding speech distortion and the like.
- the replacement unit 503 may replace the amplitude component for each subband instead of for each frequency.
- FIG. 15 is a diagram for explaining the configuration of the replacement unit 503 of the signal processing device according to the present embodiment.
- the replacement unit 503 according to the present embodiment is different from the fifth embodiment in that it includes a comparison unit 1531 and an upper replacement unit 1532. Since other configurations and operations are the same as those of the fifth embodiment, the same configurations and operations are denoted by the same reference numerals, and detailed description thereof is omitted.
- the comparison unit 1531 compares the deteriorated signal amplitude spectrum
- a linear mapping function as the first function.
- the upper substitution unit 1532 receives the voice presence / absence signal (0/1) from the voice detection unit 505, and if non-voice and
- ⁇ 2N (k, n), otherwise
- is not limited to the method using the linear mapping function of the stationary component spectrum N (k, n). For example, it is possible to adopt a linear function such as ⁇ 1 (k, n) N (k, n) + C (k, n).
- ⁇ 2 (k, n) can be obtained for each time by the following procedure (1) ⁇ (2).
- (1) Calculate the short-time moving average X_bar (k, n) of the input signal in advance (k and n are indices corresponding to frequency and time, respectively), for example:
- (
- (2) Calculate the difference between the short-time moving average (
- ⁇ 2_hat (k, n) 0.5 ⁇ ⁇ 2 (k, n) (multiply by a constant value by a constant).
- ⁇ 2_hat (k, n)
- calculate).
- ⁇ 2_hat (k, n) 0.8 ⁇
- the method of obtaining ⁇ 2 (k, n) is not limited to the above.
- ⁇ 2 (k, n) which is a constant value regardless of the time, may be set in advance.
- the value of ⁇ 2 (k, n) may be determined by actually listening to the processed speech. That is, the value of ⁇ 2 (k, n) may be determined according to the characteristics of the microphone and the device to which the microphone is attached.
- the calculation formulas 1 to 3 are used to calculate the coefficient ⁇ 2 (k, ⁇ n) before and after time n and the short-time moving average
- ⁇ 2 (k, n) ⁇ 1 (k, n) may be set.
- FIG. 16 is a diagram for explaining the configuration of the replacement unit 503 of the signal processing device according to the present embodiment.
- the replacement unit 503 according to the present embodiment is different from the fifth embodiment in that it includes a comparison unit 1631 and a lower replacement unit 1632. Since other configurations and operations are the same as those of the fifth embodiment, the same configurations and operations are denoted by the same reference numerals, and detailed description thereof is omitted.
- the comparison unit 1631 compares the deteriorated signal amplitude spectrum
- the lower replacement unit 1632 is a non-voice section and has an amplitude (power). Only when the component
- the lower replacement unit 1632 receives the voice presence / absence signal (0/1) from the voice detection unit 505, receives non-voice, and
- ⁇ 2 (k, n) N (k, n), otherwise
- ⁇ (k, n) can be obtained for each time by the following procedure (1) ⁇ (2).
- the difference between the short-time moving average (X_bar (k, n)) and the value after replacement ( ⁇ 2 (k, n) ⁇ N (k, n)) is calculated.
- the method for obtaining ⁇ 2 (k, n) is not limited to the above.
- ⁇ 2 (k, n) that becomes a constant value regardless of the time may be set in advance.
- the value of ⁇ 2 (k, n) may be determined by actually listening to the processed speech. That is, the value of ⁇ 2 (k, n) may be determined according to the characteristics of the microphone and the device to which the microphone is attached.
- the calculation formulas 1 to 3 are used to calculate the coefficient ⁇ 2 (k, n) before and after time n and the short-time moving average
- . If the condition is not satisfied, ⁇ 2 (k, n) ⁇ 1 (k, n) may be set.
- FIG. 17 is a diagram for explaining the configuration of the replacement unit 503 of the signal processing device according to the present embodiment.
- the replacement unit 503 according to the present embodiment is different from the sixth embodiment in that it includes a second comparison unit 1733 and a lower replacement unit 1734. Since other configurations and operations are the same as those of the fifth embodiment, the same configurations and operations are denoted by the same reference numerals, and detailed description thereof is omitted.
- the upper replacement unit 1532 receives the voice presence / absence signal (0/1) from the voice detection unit 505, and if non-voice and
- ⁇ 2N (k, n), otherwise
- the lower substitution unit 1734 receives the voice presence / absence signal (0/1) from the voice detection unit 505, and is a non-voice section, and the output signal Y1 (k, n) from the upper substitution unit 1532 is steady. Only the frequency smaller than ⁇ 1 (k, n) times the component signal N (k, n) is replaced with ⁇ 2 (k, n) times the steady component signal N (k, n). The larger spectrum is taken as the output signal
- FIG. 18 is a diagram for explaining the configuration of the replacement unit 503 of the signal processing device according to the present embodiment.
- the upper replacement unit 1832 performs replacement using the coefficient ⁇ (k, n) times the deterioration amplitude signal
- the upper replacement unit 1832 receives the input amplitude component signal
- the value is replaced with ⁇ 2 times (k, n)
- ⁇ 2
- the fluctuation of the input signal is large, and the spectral shape of the output signal This is effective when you want to preserve the features as much as possible. For example, it is effective when speech recognition is desired while suppressing wind noise.
- the threshold ⁇ 1 (k, n) N (k, n) which is a predetermined coefficient multiple of the steady component signal can be maintained, the sound quality is improved.
- FIG. 19 is a diagram for explaining the configuration of the replacement unit 503 of the signal processing device according to the present embodiment.
- the replacement unit 503 according to the present embodiment is different from the eighth embodiment in that the upper replacement unit 1932 has a coefficient of the deterioration amplitude signal
- the replacement process using ⁇ 2 (k, n) times is different. Since other configurations and operations are the same as those of the eighth embodiment, the same configurations and operations are denoted by the same reference numerals, and detailed description thereof is omitted.
- the upper replacement unit 1932 is a non-speech section, and only where the amplitude (power) component
- is replaced with ⁇ 2 (k, n) times, and the smaller spectrum is used as the output signal Y (k, n) as it is. That is, if
- ⁇ 2
- the application fields of voice detection explained in the first embodiment include the following as described in Section 2.2 of Non-Patent Document 1.
- a signal in a non-voice section is removed from an input signal and only a voice section is encoded and transmitted, a reduction in transfer charge can be realized.
- the bit rate is changed between the speech section and the non-speech section at the time of encoding, more effective and high quality information communication can be performed.
- Signal processing can be performed with high performance by separating noise, dereverberation, sound source separation, and echo canceller processing in the non-voice section and the voice section.
- Non-Patent Document 2 second paragraph of Section 4.1.3 “Experimental Method”
- Non-Patent Document 3 FIG. 1
- Non-Patent Document 4 p.26
- the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention is also applicable to a case where a signal processing program that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention on a computer, a program installed on the computer, a medium storing the program, and a WWW (World Wide Web) server that downloads the program are also included in the scope of the present invention. . In particular, at least a non-transitory computer readable medium that stores a signal processing program that causes a computer to execute the processing steps included in the above-described embodiments is included in the scope of the present invention.
- the input signal is converted into an amplitude component signal in the frequency domain (S2001).
- a norm of change of the amplitude component signal in the frequency direction is calculated (S2003).
- the calculated norm of change is integrated (S2005).
- the voice in the input signal is analyzed according to the integrated value (S2007).
- the program module for performing each of these processes is stored in the memory 2004, and the CPU 2002 can obtain the same effect as that of the first embodiment by sequentially executing the program modules stored in the memory 2004.
- the CPU 2002 executes the program module corresponding to each functional configuration described in the block diagram from the memory 2004, so that the effects of the respective embodiments can be obtained.
- (Appendix 3) Frequency direction smoothing means for smoothing the amplitude component signal in the frequency direction;
- the calculating means calculates a norm of a change in frequency direction of the amplitude component signal smoothed by the frequency direction smoothing means;
- the integrating means integrates the norm of the change calculated by the calculating means,
- the signal processing apparatus according to appendix 1 or 2, wherein the analysis unit determines the presence of a female voice or a child voice based on the integrated value.
- (Appendix 4) Further comprising time direction smoothing means for smoothing the amplitude component signal in the time direction;
- the calculating unit calculates a norm of a change in a frequency direction of the amplitude component signal smoothed by the time direction smoothing unit;
- the integrating means integrates the norm of the change calculated by the calculating means,
- the signal processing apparatus according to any one of appendices 1 to 3, wherein the analysis unit determines presence of a male voice based on the integrated value.
- the analysis unit determines the presence of the voice of the specific person by comparing the integrated value with an integrated value related to the voice of the specific person stored in advance. Signal processing equipment.
- (Appendix 6) A conversion step of converting the input signal into an amplitude component signal in the frequency domain; A calculation step of calculating a norm of a change in the amplitude component signal in the frequency direction; An integration step of integrating the norm of the change calculated in the calculation step;
- a signal processing method comprising: (Appendix 7) A conversion step of converting the input signal into an amplitude component signal in the frequency domain; A calculation step of calculating a norm of a change in the amplitude component signal in the frequency direction; An integration step of integrating the norm of the change calculated in the calculation step; A signal processing program for causing a computer to execute.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Circuit For Audible Band Transducer (AREA)
- Telephone Function (AREA)
Abstract
Description
入力信号を、周波数領域における振幅成分信号に変換する変換手段と、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出手段と、
前記算出手段が算出した前記変化のノルムを積算する積算手段と、
前記積算手段によって算出された積算値に応じて、前記入力信号中の音を解析する解析手段と、
を備えた。
入力信号を、周波数領域における振幅成分信号に変換する変換ステップと、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出ステップと、
前記算出ステップにおいて算出された前記変化のノルムを積算する積算ステップと、
を備えた。
入力信号を、周波数領域における振幅成分信号に変換する変換ステップと、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出ステップと、
前記算出ステップにおいて算出された前記変化のノルムを積算する積算ステップと、
をコンピュータに実行させる。
本発明の第1実施形態としての信号処理装置100について、図1を用いて説明する。信号処理装置100は、入力信号における音声の存在を判定する装置である。
次に本発明の第2実施形態に係る信号処理装置について、図2を用いて説明する。図2は、本実施形態に係る信号処理装置の機能的構成を説明するための図である。
次に本発明の第3実施形態に係る信号処理装置について、図3を用いて説明する。図3は、本実施形態に係る信号処理装置の機能的構成を説明するための図である。
次に本発明の第4実施形態に係る信号処理装置について、図4を用いて説明する。図4は、本実施形態に係る信号処理装置の機能的構成を説明するための図である。
次に本発明の第5実施形態に係る信号処理装置について説明する。本実施形態に係る信号処理装置は、例えば、風切り音のような非定常雑音を適切に抑圧する。簡単に説明すると、周波数領域において、入力音中の定常成分を推定して、推定された定常成分で入力音の一部または全部を置換する。ここで入力音は音声に限定されるものではない。例えば、音声以外に、環境音(街頭の雑踏の音、電車・自動車の走行音、警報・警告音、拍手の音など)、人や動物の声(小鳥のさえずり、犬・猫の鳴き声、笑い声や泣き声、歓声、など)、音楽などを入力音としてもよい。なお、本実施形態では、音声を入力音の代表例として説明する。
図7は、変換部501の構成を示すブロック図である。図7に示すように、変換部501はフレーム分割部711、窓がけ処理部(windowing unit)712、およびフーリエ変換部713を含む。劣化信号サンプルは、フレーム分割部711に供給され、K/2サンプルごとのフレームに分割される。ここで、Kは偶数とする。フレームに分割された劣化信号サンプルは、窓がけ処理部712に供給され、窓関数(window function)であるw(t)との乗算が行なわれる。第nフレームの入力信号x(t, n) (t=0, 1, ..., K/2-1) に対するw(t)で窓がけ(windowing)された信号は、次式で与えられる。
図8は、逆変換部504の構成を示すブロック図である。図8に示すように、逆変換部504は逆フーリエ変換部811、窓がけ処理部812およびフレーム合成部813を含む。逆フーリエ変換部811は、置換部503から供給された強調信号振幅スペクトル(|Y(k, n)|)(図中Y)と変換部501から供給された劣化信号位相スペクトル520(arg |X(k, n)|)とを乗算して、強調信号スペクトル(以下の式の左辺)を求める。
定常成分スペクトルとは、入力信号振幅スペクトルに含まれる定常成分のことである。定常成分は、パワーの時間変化が入力信号よりも小さいという特徴を持つ。時間変化は差分または比で算出されることが一般的である。時間変化を差分で計算する場合、あるフレーム n において入力信号振幅スペクトルと定常成分スペクトルを比較すると、次式の関係を満足する周波数 k が少なくとも1つは存在する。
つまり、全てのフレームnと周波数 k において、上式の左辺のほうが右辺よりも必ず大きい場合、N(k, n)は定常成分スペクトルでは無い、と定義できる。関数がXとNの指数や対数、累乗でも同様に定義できる。
定常成分推定部502における定常成分スペクトルN(k, n)の推定には、非特許文献1や非特許文献2に記載の方法など、様々な推定方法が利用できる。
図9は、音声検出部505、605の一例の構成を示す図である。音声検出部505、605は、周波数方向差分算出部902と積算部903と解析部904とを含む。周波数方向差分算出部902は、周波数方向における振幅成分信号の変化のノルムを算出する。周波数方向の変化とは、主に隣り合う周波数成分の差分や比のことを言う。例えば、変化を差分と定義した場合、振幅成分信号を|X(k, n)| とすると(ただし、kは周波数番号、nはフレーム番号)、周波数方向の変化のノルムD(k, n)を次のように計算する。D(k, n) = Lm|(X(k-1, n)| - |X(k, n)|)Lm(・)はLmノルムを表す。mは、1や2のほか、無限大でもよい。L1ノルムならば、D(k, n)は差分の絶対値、つまり以下のように計算できる。D(k, n) = ||X(k-1, n)|-|X(k, n)||
図10、図11は、ある時刻nにおける、劣化信号振幅スペクトル(入力信号スペクトル)|X(k, n)|を示す図である。図10のように、隣り合う周波数での振幅の差分のノルムの総和が小さいとき、雑音と判定する。一方、図11のように、隣り合う周波数での振幅の差分のノルムの総和が大きいとき、音声(所望音声、目的音)と判定する。
図12は、音声検出部505、605の他の例の構成を示す図である。この例での音声検出部505、605は、周波数方向差分算出部1202、1212と積算部1203、1213と解析部1204と周波数方向平滑化部1205と時間方向平滑化部1215とを含む。周波数方向平滑化部1205は、劣化振幅成分信号|X(k, n)|を、周波数方向に平滑化する。
図13は、性別によるスペクトル形状の違いを説明するための図である。グラフ1301、1302を比べれば分かるように、女声および子供声は、男声と比較して変動がゆるやかである。男声は変動が密集しているため、周波数方向に平滑化すると滑らかなカーブになり、雑音に類似した波形となる。つまり、周波数方向平滑化部1205を用いることにより、女声および子供声を、正確に抽出できる。一方、女声および子供声は変動が緩やかであるため、時間方向に平滑化すると滑らかなカーブになり、雑音に類似した波形となる。つまり、時間方向平滑化部1215を用いることにより、男声を、正確に抽出できる。
図14は、p(k, n)の値に応じた、出力信号Y(k, n)のスペクトル形状の変化を示す図である。図14の上のグラフは、p(k, n)が1(=音声)に近いときを表わしており、処理結果Y(k, n)は、入力信号|X(k, n)|により近いスペクトル形状となる。一方、図14の下のグラフは、p(k, n)が0(=非音声)に近いときを表わしており、処理結果Y(k, n)は、定常成分信号N(k, n)により近いスペクトル形状となる。
図5に示した置換部503で定常成分信号N(k, n)に積算する係数α(k, n)としては経験的に適切な値を決める。例えば、α(k, n)=1なら、Y(k, n)=N(k, n)となり、定常成分信号N(k, n)がそのまま逆変換部504への出力信号となる。このとき、定常成分信号N(k, n)が大きいと、大きな雑音が残ってしまう。そこで、逆変換部504へ出力する振幅成分信号の最大値が所定値以下になるように、α(k, n)を定めてもよい。例えば、α(k, n)=0.5ならパワー半分の定常成分信号に置き換えることになる。α(k, n)=0.1だと、音は小さくなって形は定常成分信号N(k, n)と同じスペクトル形になる。
次に本発明の第6実施形態に係る信号処理装置について、図15を用いて説明する。図15は、本実施形態に係る信号処理装置の置換部503の構成を説明するための図である。本実施形態に係る置換部503は、上記第5実施形態と比べると、比較部1531と上側置換部1532を有する点で異なる。その他の構成および動作は、第5実施形態と同様であるため、同じ構成および動作については同じ符号を付してその詳しい説明を省略する。
劣化信号振幅スペクトル|X(k, n)|との比較に用いるスペクトルの計算方法は、定常成分スペクトルN(k, n)の線形写像関数を用いた方法に限定されない。例えば、α1(k, n)N(k, n)+C(k, n) のように一次関数を採用することも可能である。その際、C(k, n) < 0とすれば定常成分信号に置き換えられる帯域が増えるので、耳障りな非定常雑音を多く抑圧できる。その他にも、高次の多項式関数や非線形関数など、他の形で表される定常成分スペクトルN(k, n)の関数を用いることも可能である。
(1)あらかじめ入力信号の短時間移動平均X_bar(k, n)(kとnは、それぞれ周波数および時刻に対応するインデックス)を、例えば、以下のように計算しておく|X_bar(k, n)| = (|X(k, n-2)| + |X(k, n-1)| + |X(k, n)| + |X(k, n+1)| + |X(k, n+2)|)/5(2)短時間移動平均(|X_bar(k, n)|)と置き換え後の値(α2(k, n)・N(k, n))の差分を計算し、差分が大きければ、差分が小さくなるようにα2(k, n)の値を変更する。変更後の値を α2_hat(k, n) とすると、変更方法は、以下が考えられる。 (a)一様に α2_hat(k, n) = 0.5・α2(k, n) とする (あらかじめ定めた値で定数倍する)。 (b)α2_hat(k, n) = |X_bar(k, n)|/|N(k, n)| とする (|X_bar(k, n)| と |N(k, n)|を使って計算する)。 (c)α2_hat(k, n) = 0.8・|X_bar(k, n)|/|N(k, n)| + 0.2 とする (同上)。
計算式1:α2(k, n-1) = |X_bar(k, n)|/N(k, n)
計算式2:α2(k, n) = |X_bar(k, n)|/N(k, n)
計算式3:α2(k, n+1) = |X_bar(k, n)|/N(k, n)
このように、定常成分信号N(k, n)では短い時間での振幅成分信号の「飛び出し」を抑えきれない場合には、短時間移動平均を用いて置換して、音質を向上させることもできる。
次に本発明の第7実施形態に係る信号処理装置について、図16を用いて説明する。図16は、本実施形態に係る信号処理装置の置換部503の構成を説明するための図である。本実施形態に係る置換部503は、上記第5実施形態と比べると、比較部1631と下側置換部1632を有する点で異なる。その他の構成および動作は、第5実施形態と同様であるため、同じ構成および動作については同じ符号を付してその詳しい説明を省略する。
成分|X(k, n)|が定常成分信号N(k, n)のβ1(k, n)倍より小さいところのみ、定常成分信号N(k, n)のβ2(k, n)倍に置き換え、大きいところはそのままのスペクトル形状を置換部603の出力信号Y(k, n)とする。つまり、下側置換部1632は、音声検出部505からの音声有無信号(0/1)を入力し、非音声、かつ|X(k, n)|<β1(k, n)N(k, n)ならば|Y(k, n)|=β2(k, n)N(k, n)、そうでなければ|Y(k, n)|=|X(k, n)|とする。
(1)あらかじめ入力信号の短時間移動平均X_bar(k, n)(kとnは、それぞれ周波数および時刻に対応するインデックス)を、例えば、以下のように計算しておくX_bar(k, n) = (X(k, n-2) + X(k, n-1) + X(k, n) + X(k, n+1) + X(k, n+2))/5(2)短時間移動平均(X_bar(k, n))と置き換え後の値(β2(k, n)・N(k, n))の差分を計算し、差分が大きければ、差分が小さくなるようにβ2(k, n)の値を変更する。変更後の値を β2_hat(k, n) とすると、変更方法は、以下が考えられる。 (a)一様に β2_hat(k, n) = 0.5・β2(k, n) とする (あらかじめ定めた値で定数倍する)。 (b)β2_hat(k, n) = X_bar(k, n)/N(k, n) とする (X_bar(k, n) と N(k, n)を使って計算する)。 (c)β2_hat(k, n) = 0.8・X_bar(k, n)/N(k, n) + 0.2 とする (同上)。
計算式1:β2(k, n-1) = X_bar(k, n)/N(k, n)
計算式2:β2(k, n) = X_bar(k, n)/N(k, n)
計算式3:β2(k, n+1) = X_bar(k, n)/N(k, n)
このように、定常成分信号N(k, n)では、短い時間での振幅成分の「飛び出し」を抑えきれない場合には、短時間移動平均を用いて置換して、音質を向上させることもできる。
次に本発明の第8実施形態に係る信号処理装置について、図17を用いて説明する。図17は、本実施形態に係る信号処理装置の置換部503の構成を説明するための図である。本実施形態に係る置換部503は、上記第6実施形態と比べると、第2比較部1733と下側置換部1734を有する点で異なる。その他の構成および動作は、第5実施形態と同様であるため、同じ構成および動作については同じ符号を付してその詳しい説明を省略する。
次に本発明の第9実施形態に係る信号処理装置について、図18を用いて説明する。図18は、本実施形態に係る信号処理装置の置換部503の構成を説明するための図である。本実施形態に係る置換部503は、上記第6実施形態と比べると、上側置換部1832が劣化振幅信号|X(k, n)|の係数α(k, n)倍を用いて置換を行う処理が異なる。その他の構成および動作は、第3実施形態と同様であるため、同じ構成および動作については同じ符号を付してその詳しい説明を省略する。
次に本発明の第10実施形態に係る信号処理装置について、図19を用いて説明する。図19は、本実施形態に係る信号処理装置の置換部503の構成を説明するための図である。本実施形態に係る置換部503は、上記第8実施形態と比べると、上側置換部1932が、第9実施形態の上側置換部1832のように劣化振幅信号|X(k, n)|の係数α2(k, n)倍を用いて置換を行う処理が異なる。その他の構成および動作は、第8実施形態と同様であるため、同じ構成および動作については同じ符号を付してその詳しい説明を省略する。
第1実施形態で説明した音声検出の応用分野については、非特許文献1の2.2節に記載されている通り、以下のものがある。
(1)入力信号から非音声区間の信号を取り除き音声区間のみを符号化して伝送すれば、転送料の削減を実現できる。あるいは、符号化の際に、音声区間と非音声区間とでビットレートを変更すれば、より効果的かつ高品質な情報通信を行なうことができる。
(2)非音声区間と音声区間とで雑音、残響除去、音源分離、エコーキャンセラの処理の切り分けを行なうことで、信号処理を高性能に行なうことができる。
(3)音声認識技術を適用する際に、音声区間と非音声区間とを切り分け、音声区間のみを認識対象とすることで認識誤りを低下させることができる。
(2)複数人が参加した会議の音声データを解析する際に、誰がいつ話したかを判定する。
(3)テレビ放送や映画の字幕などの自動作成の際に、誰がいつ話したかを判定する。
以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。また、それぞれの実施形態に含まれる別々の特徴を如何様に組み合わせたシステムまたは装置も、本発明の範疇に含まれる。
上記の実施形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。
(付記1)
入力信号を、周波数領域における振幅成分信号に変換する変換手段と、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出手段と、
前記算出手段が算出した前記変化のノルムを積算する積算手段と、
前記積算手段によって算出された積算値に応じて、前記入力信号中の音を解析する解析手段と、
を備えた信号処理装置。
(付記2)
前記解析手段は、前記積算値に応じて、前記入力信号中における音の存在を判定する付記1に記載の信号処理装置。
(付記3)
前記振幅成分信号を、周波数方向に平滑化する周波数方向平滑化手段をさらに有し、
前記算出手段は、前記周波数方向平滑化手段によって平滑化された振幅成分信号の、周波数方向の変化のノルムを算出し、
前記積算手段は、前記算出手段が算出した前記変化のノルムを積算し、
前記解析手段は、前記積算値に基づいて、女声または子供声の存在を判定することを特徴とする付記1または2に記載の信号処理装置。
(付記4)
前記振幅成分信号を、時間方向に平滑化する時間方向平滑化手段をさらに有し、
前記算出手段は、前記時間方向平滑化手段によって平滑化された振幅成分信号の、周波数方向の変化のノルムを算出し、
前記積算手段は、前記算出手段が算出した前記変化のノルムを積算し、
前記解析手段は、前記積算値に基づいて、男声の存在を判定することを特徴とする付記1乃至3のいずれか1項に記載の信号処理装置。
(付記5)
前記解析手段は、前記積算値を、あらかじめ記憶された特定の人物の声に関する積算値と比較することにより、前記特定の人物の声の存在を判定することを特徴とする付記1または2に記載の信号処理装置。
(付記6)
入力信号を、周波数領域における振幅成分信号に変換する変換ステップと、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出ステップと、
前記算出ステップにおいて算出された前記変化のノルムを積算する積算ステップと、
を備えた信号処理方法。
(付記7)
入力信号を、周波数領域における振幅成分信号に変換する変換ステップと、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出ステップと、
前記算出ステップにおいて算出された前記変化のノルムを積算する積算ステップと、
をコンピュータに実行させる信号処理プログラム。
Claims (7)
- 入力信号を、周波数領域における振幅成分信号に変換する変換手段と、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出手段と、
前記算出手段が算出した前記変化のノルムを積算する積算手段と、
前記積算手段によって算出された積算値に応じて、前記入力信号中の音を解析する解析手段と、
を備えた信号処理装置。 - 前記解析手段は、前記積算値に応じて、前記入力信号中における音の存在を判定する請求項1に記載の信号処理装置。
- 前記振幅成分信号を、周波数方向に平滑化する周波数方向平滑化手段をさらに有し、
前記算出手段は、前記周波数方向平滑化手段によって平滑化された振幅成分信号の、周波数方向の変化のノルムを算出し、
前記積算手段は、前記算出手段が算出した前記変化のノルムを積算し、
前記解析手段は、前記積算値に基づいて、女声または子供声の存在を判定する請求項1または2に記載の信号処理装置。 - 前記振幅成分信号を、時間方向に平滑化する時間方向平滑化手段をさらに有し、
前記算出手段は、前記時間方向平滑化手段によって平滑化された振幅成分信号の、周波数方向の変化のノルムを算出し、
前記積算手段は、前記算出手段が算出した前記変化のノルムを積算し、
前記解析手段は、前記積算値に基づいて、男声の存在を判定する請求項1乃至3のいずれか1項に記載の信号処理装置。 - 前記解析手段は、前記積算値を、あらかじめ記憶された特定の人物の声に関する積算値と比較することにより、前記特定の人物の声の存在を判定する請求項1または2に記載の信号処理装置。
- 入力信号を、周波数領域における振幅成分信号に変換する変換ステップと、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出ステップと、
前記算出ステップにおいて算出された前記変化のノルムを積算する積算ステップと、
を備えた信号処理方法。 - 入力信号を、周波数領域における振幅成分信号に変換する変換ステップと、
周波数方向における前記振幅成分信号の変化のノルムを算出する算出ステップと、
前記算出ステップにおいて算出された前記変化のノルムを積算する積算ステップと、
をコンピュータに実行させる信号処理プログラム。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201480020787.6A CN105103230B (zh) | 2013-04-11 | 2014-03-27 | 信号处理装置、信号处理方法、信号处理程序 |
EP14782146.6A EP2985762A4 (en) | 2013-04-11 | 2014-03-27 | SIGNAL PROCESSING DEVICE, SIGNAL PROCESSING METHOD, AND SIGNAL PROCESSING PROGRAM |
US14/782,928 US10431243B2 (en) | 2013-04-11 | 2014-03-27 | Signal processing apparatus, signal processing method, signal processing program |
JP2015511205A JP6439682B2 (ja) | 2013-04-11 | 2014-03-27 | 信号処理装置、信号処理方法および信号処理プログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2013-083412 | 2013-04-11 | ||
JP2013083412 | 2013-04-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014168022A1 true WO2014168022A1 (ja) | 2014-10-16 |
Family
ID=51689433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2014/058962 WO2014168022A1 (ja) | 2013-04-11 | 2014-03-27 | 信号処理装置、信号処理方法および信号処理プログラム |
Country Status (5)
Country | Link |
---|---|
US (1) | US10431243B2 (ja) |
EP (1) | EP2985762A4 (ja) |
JP (1) | JP6439682B2 (ja) |
CN (1) | CN105103230B (ja) |
WO (1) | WO2014168022A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114242098A (zh) * | 2021-12-13 | 2022-03-25 | 北京百度网讯科技有限公司 | 一种语音增强方法、装置、设备以及存储介质 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9830925B2 (en) * | 2014-10-22 | 2017-11-28 | GM Global Technology Operations LLC | Selective noise suppression during automatic speech recognition |
EP3223279B1 (en) * | 2016-03-21 | 2019-01-09 | Nxp B.V. | A speech signal processing circuit |
US10535360B1 (en) * | 2017-05-25 | 2020-01-14 | Tp Lab, Inc. | Phone stand using a plurality of directional speakers |
CN113986187B (zh) * | 2018-12-28 | 2024-05-17 | 阿波罗智联(北京)科技有限公司 | 音区幅值获取方法、装置、电子设备及存储介质 |
CN112152731B (zh) * | 2020-09-08 | 2023-01-20 | 重庆邮电大学 | 一种基于分形维数的无人机探测与识别方法 |
CN112528853B (zh) * | 2020-12-09 | 2021-11-02 | 云南电网有限责任公司昭通供电局 | 改进型双树复小波变换去噪方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002236494A (ja) * | 2001-02-09 | 2002-08-23 | Denso Corp | 音声区間判別装置、音声認識装置、プログラム及び記録媒体 |
JP2004272052A (ja) * | 2003-03-11 | 2004-09-30 | Fujitsu Ltd | 音声区間検出装置 |
JP2013005418A (ja) | 2011-06-22 | 2013-01-07 | Canon Inc | 撮像装置及び再生装置 |
Family Cites Families (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5189701A (en) * | 1991-10-25 | 1993-02-23 | Micom Communications Corp. | Voice coder/decoder and methods of coding/decoding |
US6978236B1 (en) * | 1999-10-01 | 2005-12-20 | Coding Technologies Ab | Efficient spectral envelope coding using variable time/frequency resolution and time/frequency switching |
JP3454206B2 (ja) * | 1999-11-10 | 2003-10-06 | 三菱電機株式会社 | 雑音抑圧装置及び雑音抑圧方法 |
DE60108104T2 (de) | 2001-07-24 | 2005-12-15 | Sony International (Europe) Gmbh | Verfahren zur Sprecheridentifikation |
US7240007B2 (en) * | 2001-12-13 | 2007-07-03 | Matsushita Electric Industrial Co., Ltd. | Speaker authentication by fusion of voiceprint match attempt results with additional information |
US8321427B2 (en) * | 2002-10-31 | 2012-11-27 | Promptu Systems Corporation | Method and apparatus for generation and augmentation of search terms from external and internal sources |
WO2004111996A1 (ja) * | 2003-06-11 | 2004-12-23 | Matsushita Electric Industrial Co., Ltd. | 音響区間検出方法および装置 |
SG120121A1 (en) * | 2003-09-26 | 2006-03-28 | St Microelectronics Asia | Pitch detection of speech signals |
EP1806739B1 (en) | 2004-10-28 | 2012-08-15 | Fujitsu Ltd. | Noise suppressor |
JP4753821B2 (ja) * | 2006-09-25 | 2011-08-24 | 富士通株式会社 | 音信号補正方法、音信号補正装置及びコンピュータプログラム |
JP4264841B2 (ja) * | 2006-12-01 | 2009-05-20 | ソニー株式会社 | 音声認識装置および音声認識方法、並びに、プログラム |
WO2009027980A1 (en) * | 2007-08-28 | 2009-03-05 | Yissum Research Development Company Of The Hebrew University Of Jerusalem | Method, device and system for speech recognition |
JPWO2009084221A1 (ja) * | 2007-12-27 | 2011-05-12 | パナソニック株式会社 | 符号化装置、復号装置およびこれらの方法 |
US8306817B2 (en) * | 2008-01-08 | 2012-11-06 | Microsoft Corporation | Speech recognition with non-linear noise reduction on Mel-frequency cepstra |
AU2009290150B2 (en) * | 2008-09-05 | 2011-11-03 | Auraya Pty Ltd | Voice authentication system and methods |
US8332223B2 (en) * | 2008-10-24 | 2012-12-11 | Nuance Communications, Inc. | Speaker verification methods and apparatus |
JP5459220B2 (ja) * | 2008-11-27 | 2014-04-02 | 日本電気株式会社 | 発話音声検出装置 |
JP5293329B2 (ja) * | 2009-03-26 | 2013-09-18 | 富士通株式会社 | 音声信号評価プログラム、音声信号評価装置、音声信号評価方法 |
JP5223786B2 (ja) * | 2009-06-10 | 2013-06-26 | 富士通株式会社 | 音声帯域拡張装置、音声帯域拡張方法及び音声帯域拡張用コンピュータプログラムならびに電話機 |
JP5267362B2 (ja) * | 2009-07-03 | 2013-08-21 | 富士通株式会社 | オーディオ符号化装置、オーディオ符号化方法及びオーディオ符号化用コンピュータプログラムならびに映像伝送装置 |
US20110125494A1 (en) * | 2009-11-23 | 2011-05-26 | Cambridge Silicon Radio Limited | Speech Intelligibility |
GB2476043B (en) * | 2009-12-08 | 2016-10-26 | Skype | Decoding speech signals |
US8831942B1 (en) * | 2010-03-19 | 2014-09-09 | Narus, Inc. | System and method for pitch based gender identification with suspicious speaker detection |
JP5834449B2 (ja) * | 2010-04-22 | 2015-12-24 | 富士通株式会社 | 発話状態検出装置、発話状態検出プログラムおよび発話状態検出方法 |
CN102737480B (zh) * | 2012-07-09 | 2014-03-05 | 广州市浩云安防科技股份有限公司 | 一种基于智能视频的异常语音监控系统及方法 |
US8924209B2 (en) * | 2012-09-12 | 2014-12-30 | Zanavox | Identifying spoken commands by templates of ordered voiced and unvoiced sound intervals |
WO2014094242A1 (en) * | 2012-12-18 | 2014-06-26 | Motorola Solutions, Inc. | Method and apparatus for mitigating feedback in a digital radio receiver |
-
2014
- 2014-03-27 CN CN201480020787.6A patent/CN105103230B/zh active Active
- 2014-03-27 WO PCT/JP2014/058962 patent/WO2014168022A1/ja active Application Filing
- 2014-03-27 US US14/782,928 patent/US10431243B2/en active Active
- 2014-03-27 EP EP14782146.6A patent/EP2985762A4/en not_active Withdrawn
- 2014-03-27 JP JP2015511205A patent/JP6439682B2/ja active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002236494A (ja) * | 2001-02-09 | 2002-08-23 | Denso Corp | 音声区間判別装置、音声認識装置、プログラム及び記録媒体 |
JP2004272052A (ja) * | 2003-03-11 | 2004-09-30 | Fujitsu Ltd | 音声区間検出装置 |
JP2013005418A (ja) | 2011-06-22 | 2013-01-07 | Canon Inc | 撮像装置及び再生装置 |
Non-Patent Citations (5)
Title |
---|
DOUGLAS A. REYNOLDS; THOMAS F. QUATIERI; ROBERT B. DUNN: "Speaker Verification Using Adapted Gaussian Mixture Models", DIGITAL SIGNAL PROCESSING, vol. 10, 2000, pages 19 - 41, XP055282688, DOI: doi:10.1006/dspr.1999.0361 |
KEN HANAZAWA; RYOSUKE ISOTANI: "Gender-Independent Speech Recognition by Look-Ahead Model Selection", PROCEEDINGS OF THE ACOUSTICAL SOCIETY OF JAPAN, September 2004 (2004-09-01), pages 197 - 198 |
MASAKIYO FUJIMOTO: "The Fundamentals and Recent Progress of Voice Activity Detection", IEICE TECHNICAL REPORT SP2010-23, June 2010 (2010-06-01) |
See also references of EP2985762A4 * |
TSUNEO KATO; SHINGO KUROIWA; TOHRU SHIMIZU; NORIO HIGUCHI: "Tree-Based Clustering for Gaussian Mixture HMMs", IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCES D-II, vol. J83-D-II, no. 11, November 2000 (2000-11-01), pages 2128 - 2136 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114242098A (zh) * | 2021-12-13 | 2022-03-25 | 北京百度网讯科技有限公司 | 一种语音增强方法、装置、设备以及存储介质 |
CN114242098B (zh) * | 2021-12-13 | 2023-08-29 | 北京百度网讯科技有限公司 | 一种语音增强方法、装置、设备以及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
EP2985762A1 (en) | 2016-02-17 |
EP2985762A4 (en) | 2016-11-23 |
CN105103230A (zh) | 2015-11-25 |
CN105103230B (zh) | 2020-01-03 |
US20160071529A1 (en) | 2016-03-10 |
JP6439682B2 (ja) | 2018-12-19 |
US10431243B2 (en) | 2019-10-01 |
JPWO2014168022A1 (ja) | 2017-02-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6439682B2 (ja) | 信号処理装置、信号処理方法および信号処理プログラム | |
Li et al. | Glance and gaze: A collaborative learning framework for single-channel speech enhancement | |
US8655656B2 (en) | Method and system for assessing intelligibility of speech represented by a speech signal | |
JP5127754B2 (ja) | 信号処理装置 | |
WO2021114733A1 (zh) | 一种分频段进行处理的噪声抑制方法及其系统 | |
Nikzad et al. | Deep residual-dense lattice network for speech enhancement | |
JP6544234B2 (ja) | 信号処理装置、信号処理方法および信号処理プログラム | |
Zheng et al. | Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods | |
Bach et al. | Robust speech detection in real acoustic backgrounds with perceptually motivated features | |
CN103258537A (zh) | 利用特征结合对语音情感进行识别的方法及其装置 | |
Min et al. | Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement | |
Zhang et al. | Monaural speech enhancement using a multi-branch temporal convolutional network | |
JP5443547B2 (ja) | 信号処理装置 | |
Dong et al. | Towards real-world objective speech quality and intelligibility assessment using speech-enhancement residuals and convolutional long short-term memory networks | |
Saleem et al. | Variance based time-frequency mask estimation for unsupervised speech enhancement | |
CN104036785A (zh) | 语音信号的处理方法和装置、以及语音信号的分析系统 | |
TWI749547B (zh) | 應用深度學習的語音增強系統 | |
Hussain et al. | A speech intelligibility enhancement model based on canonical correlation and deep learning for hearing-assistive technologies | |
Faycal et al. | Comparative performance study of several features for voiced/non-voiced classification | |
Hamid et al. | Single Channel Speech Enhancement Using Adaptive Soft‐Thresholding with Bivariate EMD | |
US11176957B2 (en) | Low complexity detection of voiced speech and pitch estimation | |
Paul et al. | Effective Pitch Estimation using Canonical Correlation Analysis | |
JPH01255000A (ja) | 音声認識システムに使用されるテンプレートに雑音を選択的に付加するための装置及び方法 | |
Sapozhnykov | Sub-band detector for wind-induced noise | |
US20230419980A1 (en) | Information processing device, and output method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201480020787.6 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14782146 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2015511205 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14782928 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2014782146 Country of ref document: EP |