WO2013145578A1 - Audio processing device, audio processing method, and audio processing program - Google Patents
Audio processing device, audio processing method, and audio processing program Download PDFInfo
- Publication number
- WO2013145578A1 WO2013145578A1 PCT/JP2013/001448 JP2013001448W WO2013145578A1 WO 2013145578 A1 WO2013145578 A1 WO 2013145578A1 JP 2013001448 W JP2013001448 W JP 2013001448W WO 2013145578 A1 WO2013145578 A1 WO 2013145578A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- spectrum
- speech
- expected value
- standard pattern
- distance
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0264—Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
Definitions
- the present invention relates to a voice processing apparatus, a voice processing method, and a voice processing program that extract only a desired voice in a noisy environment or an environment where a plurality of people are talking.
- Patent Document 1 is effective when the performance of sound source separation is not sufficient due to the presence of non-directional environmental noise, reverberation, or the existence of sound sources with more than the number of microphones. I can't expect it.
- Patent Document 2 when the performance of sound source separation is not sufficient, a large value mask is performed and the desired sound is also erased.
- Patent Document 3 when the estimated noise value is low and the temporary estimated speech is significantly different from the original speech, the correction using the standard pattern may fail.
- an object of the present invention is to provide an audio processing device, an audio processing method, and an audio processing program capable of accurately acquiring only desired audio from a signal in which desired audio and other signals are mixed.
- the speech processing apparatus includes a plurality of mixed signal spectra calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively.
- a distance calculation unit that calculates a distance to the input signal for each of the standard patterns, and a desired speech spectrum expected value calculation unit that calculates an expected value of the desired speech spectrum in the input signal using the distance. It is characterized by having.
- the speech processing method includes a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns, a distance from the input signal is calculated, and an expected value of a desired speech spectrum in the input signal is calculated as output information using the distance.
- An audio processing program is calculated by combining a computer with a standard pattern of a plurality of desired audio spectra, a standard pattern of a plurality of undesired audio spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns of the mixed signal spectrum, a process for calculating the distance to the input signal and a process for calculating the expected value of the desired speech spectrum in the input signal as output information using the distance are executed. It is characterized by that.
- the present invention it is possible to accurately acquire only the desired voice from the signal in which the desired voice and other signals are mixed. In addition, it is possible to obtain a speech recognition result for only desired speech.
- Embodiment 1 FIG. A first embodiment of the present invention will be described below with reference to the drawings.
- FIG. 1 is a block diagram showing a configuration of a first embodiment of a speech processing apparatus according to the present invention.
- the speech processing apparatus 100 includes a spectrum conversion unit 101, a distance calculation unit 102, a standard pattern storage unit 103, and a desired speech spectrum expected value calculation unit 104.
- the speech processing apparatus 100 receives a mixed signal in which desired speech and other signals are mixed, and outputs an estimated value of the desired speech.
- the signal other than the desired voice includes, for example, voice due to speech other than the desired speaker and noise other than voice.
- a speech spectrum based on a desired speaker's utterance is referred to as a desired speech spectrum.
- a speech spectrum generated by speech other than the desired speaker is referred to as an undesired speech spectrum.
- the spectrum of a noise signal other than speech is called a noise spectrum.
- the spectrum conversion unit 101 inputs a mixed signal in which desired speech and other signals are mixed.
- the spectrum conversion unit 101 acquires a spectrum vector of the mixed signal.
- the distance calculation unit 102 calculates the distance between the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the standard pattern of the mixed signal spectrum stored in the standard pattern storage unit 103.
- the standard pattern storage unit 103 stores the standard pattern of the mixed signal spectrum.
- FIG. 2 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103.
- the standard pattern storage unit 103 stores a set of a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum. Has been.
- the standard pattern of the spectrum expresses an average shape for a phoneme unit such as “A”, “I”, “U” or a unit obtained by clustering similar spectral shapes.
- ⁇ represents a vector whose elements are the average value of the spectrum size for each frequency band.
- the subscripts Y, A, B, and N indicate a mixed signal spectrum, a desired speech spectrum, an undesired speech spectrum, and a noise spectrum, respectively.
- the subscripts i, j, and k are serial numbers representing phoneme units or clustering units, respectively.
- ⁇ Yijk represents a mixed signal spectrum obtained by mixing the i-th desired speech spectrum, the j-th undesired speech spectrum, and the k-th noise spectrum.
- FIG. 3 is an explanatory diagram showing the relationship between ⁇ Yijk and ⁇ Ai , ⁇ Bj , ⁇ Nk .
- the horizontal axis represents the frequency band
- the vertical axis represents the magnitude of the spectrum.
- ⁇ Yijk and ⁇ Ai , ⁇ Bj , ⁇ Nk have the relationship shown in Equation 1 for each frequency waveband .
- the number of standard patterns of the mixed signal spectrum is I ⁇ J ⁇ K. That is, the standard pattern storage unit 103 stores the I ⁇ J ⁇ K combinations.
- standard patterns are prepared in advance by using statistical processing and machine learning techniques, and are stored in the standard pattern storage unit 103.
- the standard pattern of the undesired voice spectrum is created based on voice data of an unspecified speaker, for example.
- the standard pattern of the desired speech spectrum is created based on, for example, speech data of a specific speaker.
- the desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum as output information using the distance calculated by the distance calculation unit 102.
- the spectrum conversion unit 101, the distance calculation unit 102, and the desired speech spectrum expected value calculation unit 104 are realized by a CPU (Central Processing Unit) provided in the speech processing apparatus 100. Further, the standard pattern storage unit 103 is realized by a storage device such as a memory provided in the sound processing device 100.
- a CPU Central Processing Unit
- the standard pattern storage unit 103 is realized by a storage device such as a memory provided in the sound processing device 100.
- FIG. 4 is a flowchart showing the operation of the first embodiment of the speech processing apparatus.
- the spectrum conversion unit 101 performs short-time spectrum conversion on the mixed signal input by the speech processing apparatus 100.
- the spectrum conversion unit 101 obtains a spectrum for each frequency band, for example, every unit time such as 10 milliseconds (step S101).
- the speech processing apparatus 100 performs the processes in steps S102 to S103 on a vector having as an element a spectrum for each frequency band acquired every unit time or a subband spectrum in which a plurality of frequency bands are bundled. Assumed to be performed. Let Y be the spectral vector of the mixed signal.
- the distance calculation unit 102 calculates a distance d ijk between the spectrum vector Y of the mixed signal and the standard pattern ⁇ Yijk of the mixed signal spectrum stored in the standard pattern storage unit 103 (step S102).
- the distance is calculated for the number of standard pattern groups (I ⁇ J ⁇ K).
- the distance calculation unit 102 may obtain a general distance between vectors, for example, a Euclidean distance as the distance.
- the distance calculation unit 102 may obtain the distance after converting the spectrum vector into a logarithmic spectrum vector. Further, the distance calculation unit 102 may obtain the distance after converting into a feature amount generally used in speech recognition such as a cepstrum.
- the distance d ijk is a scalar quantity.
- desired speech spectrum expected value calculation section 104 calculates an expected value of the desired speech spectrum (step S103).
- Desired expected value E A of the speech spectrum is calculated by a weighted average of the standard patterns mu Ai desired speech spectrum.
- E A is a vector quantity. The weight is set to be small when the distance d ijk is large and to be large when dijk is small. Equation 2 is an example of a formula for E A.
- Speech processing apparatus 100 specifically, the output unit the speech processing apparatus 100 (not shown), based on the E A is the output information and outputs the voice recognition results.
- the audio processing apparatus 100 may output a E A as an estimate of the desired sound, perform such inverse Fourier transform, may be converted and output into waveform signals (wave Signal), waveform Speech recognition (speech recognition) may be performed on the signal, and the signal may be converted into text and output.
- waveform signals waveform Signal
- speech recognition speech recognition
- the standard pattern storage unit 103 stores a set of four standard patterns: a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum.
- a standard pattern of a mixed signal spectrum a standard pattern of a mixed signal spectrum
- a standard pattern of a desired speech spectrum a standard pattern of a desired speech spectrum
- a standard pattern of an undesired speech spectrum a standard pattern of a noise spectrum.
- other storage methods are possible.
- the standard pattern storage unit 103 may store only a set of a standard pattern of a mixed signal spectrum and a standard pattern of a desired speech spectrum (a set A of standard patterns shown in FIG. 2). This is because only these two standard patterns are necessary when the desired speech spectrum is actually extracted from the input signal. The other two standard patterns are required only when the standard pattern of the mixed signal is created in advance. Also in this case, a set of I ⁇ J ⁇ K standard patterns is stored in the standard pattern storage unit 103. Specifically, a set of standard patterns of J ⁇ K mixed signal spectra is stored for one standard pattern of desired speech spectrum.
- the standard pattern of the undesired speech spectrum and the standard pattern of the noise spectrum (standard pattern set B shown in FIG. 2) is stored in the standard pattern storage unit 103. Good. This is because the standard pattern of the mixed signal spectrum can be calculated using Equation 1 when extracting the desired sound from the input signal.
- the speech processing apparatus can only perform speech from a desired speaker from a speech mixed with speech from a desired speaker, speech from other than the desired speaker, and noise. Can be obtained accurately.
- FIG. 5 is a block diagram showing the configuration of the second embodiment of the speech processing apparatus according to the present invention.
- the speech processing apparatus 200 of the present embodiment includes a mixed signal spectrum expected value calculation unit 201, a desired speech enhancement filter calculation unit 202, And a desired speech spectrum estimation unit 203.
- the mixed signal spectrum expected value calculation unit 201 calculates the expected value of the mixed signal spectrum using the distance calculated by the distance calculation unit 102.
- the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum and the expected value of the mixed signal spectrum.
- the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum as output information based on the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the desired speech enhancement filter.
- the mixed signal spectrum expected value calculation unit 201, the desired speech enhancement filter calculation unit 202, and the desired speech spectrum estimation unit 203 are realized by a CPU provided in the speech processing apparatus 200.
- the spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.
- FIG. 6 is a flowchart showing the operation of the second embodiment of the speech processing apparatus.
- steps S201 to S203 Since the processing of steps S201 to S203 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.
- the mixed signal spectrum expected value calculation unit 201 calculates an expected value of the mixed signal spectrum (step S204).
- the expected value E Y of the mixed signal spectrum is calculated as a weighted average of the standard pattern ⁇ Yijk of the mixed signal spectrum.
- E Y is a vector quantity.
- the calculation method in step S204 is the same as that in step S103 of the first embodiment. Equation 3 is an example of a formula for E Y.
- the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S205).
- Formula 4 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 4 is performed for each vector element. W is a vector quantity.
- the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S206). Desired calculation of the speech spectrum estimate F A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element.
- F A is a vector quantity. Equation 5 is an example of a formula for F A.
- the speech processing apparatus 200 may output the estimated value of the desired speech spectrum, which is output information, as it is, or performs inverse Fourier transform to return the waveform signal (wave signal) to the output form.
- speech recognition speech recognition
- the waveform signal may be converted into text and output.
- the desired speech enhancement filter calculating section 202 for example, a plurality unit time value of the desired speech spectrum expectation E A and mixed signal spectrum expectation E Y Cross averaging or smoothing may be performed.
- the W value calculated using Equation 4 may be averaged or smoothed over a plurality of unit times.
- averaging and smoothing are performed at the stage of desired speech enhancement filter calculation. This makes it robust against instantaneous estimation errors. Therefore, in addition to the effects of the first embodiment, a more reliable output can be obtained.
- FIG. 7 is a block diagram showing the configuration of the third embodiment of the speech processing apparatus according to the present invention.
- the speech processing apparatus 300 of the present embodiment includes an undesired speech spectrum expected value calculation unit 301 and a noise spectrum expected value calculation unit 302. Including. Furthermore, the speech processing apparatus 300 includes a desired speech enhancement filter calculation unit 202 and a desired speech spectrum estimation unit 203.
- the undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum using the distance calculated by the distance calculation unit 102.
- the expected noise spectrum value calculation unit 302 uses the distance calculated by the distance calculation unit 102 to calculate the expected value of the noise spectrum.
- the undesired speech spectrum expected value calculation unit 301 and the noise spectrum expected value calculation unit 302 are realized by a CPU included in the speech processing apparatus 300.
- the spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.
- desired speech enhancement filter calculation unit 202 and the desired speech spectrum estimation unit 203 are the same as those in the second embodiment, and thus description thereof is omitted.
- the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum, the expected value of the undesired speech spectrum, and the expected value of the noise spectrum. .
- FIG. 8 is a flowchart showing the operation of the third embodiment of the speech processing apparatus.
- steps S301 to S303 Since the processing of steps S301 to S303 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.
- the undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum (step S304).
- Expected value E B of the desired outside the speech spectrum is calculated by the weighted average of the standard patterns mu Bj desired outside the speech spectrum.
- E B is a vector quantity.
- the calculation method in step S304 is the same as that in step S103 in the first embodiment. Equation 6 is an example of a formula for E B.
- the expected noise spectrum value calculation unit 302 calculates the expected value of the noise spectrum (step S305).
- the expected value E N of the noise spectrum is calculated as a weighted average of the standard pattern ⁇ Nk of the noise spectrum.
- E N is a vector quantity.
- the calculation method in step S305 is the same as that in step S103 of the first embodiment. Equation 7 is an example of a formula for E N.
- the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S306).
- Formula 8 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 8 is performed for each element of the vector. W is a vector quantity.
- the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S307). Desired calculation of the speech spectrum estimate F A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element. F A is a vector quantity.
- the speech processing apparatus 300 may output the estimated value of the desired speech spectrum, which is output information, as it is, may perform inverse Fourier transform, return to the waveform signal form, or output the waveform signal. Speech recognition may be performed, converted into text, and output.
- the desired speech enhancement filter calculating section 202 for example, the desired speech spectrum expectation E A and the desired extracellular signal spectrum expectation E B and the noise signal spectrum expected value the value of E N may be averaged or smoothed over a plurality of units of time.
- the value of W calculated using Equation 8 may be averaged or smoothed over a plurality of unit times.
- FIG. 9 is a block diagram showing the configuration of the fourth embodiment of the speech processing apparatus according to the present invention.
- the speech processing apparatus 400 of the present embodiment includes a reliability calculation unit 401, a mask setting unit 402, and a mask applying unit 403. including.
- the reliability calculation unit 401 calculates the reliability using the distance calculated by the distance calculation unit 102.
- entropy is given as an example of reliability.
- Equation 10 is an example of a formula for calculating entropy H.
- Entropy is an indicator of randomness, and the greater the value, the lower the reliability.
- Entropy is a scalar quantity.
- the mask setting unit 402 sets a mask value according to the reliability.
- the mask setting unit 402 sets a large mask value when the reliability is low, and sets a small mask value when the reliability is high.
- the value of the entropy H itself or an amount obtained by multiplying the entropy H by a positive multiplier can be used as the mask M.
- a mask is a vector quantity.
- the mask values may all be the same value for each frequency band, or may be different values.
- the mask setting unit 402 sets all the mask values to the same value.
- Expression 11 is an expression showing an example of the relationship between the mask M and the entropy H.
- the mask assigning unit 403 calculates an estimated value F A ′ of the desired speech spectrum added with the mask as output information.
- Formula 12 is an example of a formula for calculating F A ′.
- the reliability calculation unit 401, the mask setting unit 402, and the mask applying unit 403 are realized by a CPU provided in the voice processing device 400.
- the reliability of the estimated value of the estimated desired speech spectrum is low and other signals may be mixed. When it is high, the adverse effects of other signals can be suppressed.
- the reliability calculation unit 401 calculates the reliability using a distance from a standard pattern prepared in advance. Therefore, it can be expected that the reliability in the present embodiment is a more accurate value than the reliability used in the method described in Patent Document 2.
- the configuration of the fifth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the fifth embodiment may be the same as that of the other embodiments.
- FIG. 10 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103 according to the fifth embodiment.
- the standard pattern storage unit 103 stores a probability density function (Probability Density Function) instead of an average value as a standard pattern of a mixed signal spectrum, as shown in FIG. Specifically, the standard pattern storage unit 103 stores a probability density function representing the frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes as the standard pattern of the mixed signal spectrum. To do.
- a probability density function Probability Density Function
- a Gaussian distribution function for example, a Gaussian distribution function, a mixed Gaussian distribution function, or the like can be considered.
- a Gaussian distribution function a Gaussian distribution function
- a variance value is used in addition to the average value.
- the distance calculation unit 102 uses the Batacharya distance, the Mahalanobis distance, the likelihood, the log likelihood, and the like as the distance scale.
- the standard pattern of the mixed signal spectrum can be expressed using a higher-order statistic such as variance in addition to the average, it is possible to estimate the desired speech with higher accuracy.
- Embodiment 6 FIG. The sixth embodiment of the present invention will be described below.
- the configuration of the sixth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the sixth embodiment may be the same as that of the other embodiments.
- the way of storing a set of standard patterns stored in the standard pattern storage unit 103 is different.
- the standard pattern storage unit 103 includes three types of standard patterns I of desired speech spectrum, J standard patterns of undesired speech spectrum, and K standard patterns of noise spectrum.
- the standard patterns of I ⁇ J ⁇ K mixed signal spectra created by combining one standard pattern each are stored.
- the standard pattern storage unit 103 creates an L created by clustering and summing up spectrum patterns having similar shapes to the standard patterns of I ⁇ J ⁇ K mixed signal spectra. Stores the standard pattern of the mixed signal spectrum.
- the number of standard patterns stored in the standard pattern storage unit 103 is changed by collecting standard patterns. Therefore, the number of standard patterns can be changed according to the calculation resources, memory resources, and the like of the voice processing device.
- Embodiment 7 FIG. The seventh embodiment of the present invention will be described below.
- the configuration of the seventh embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the seventh embodiment may be the same as that of the other embodiments.
- the spectrum conversion unit 101 inputs the signal output from the voice processing device again. Then, the sound processing device repeats the processes of steps S101 to S103.
- the speech processing apparatus may reduce the number of standard patterns in the first processing and increase the number of standard patterns as the number of times is repeated.
- the estimated value of the desired speech is an average of coarse spectral shapes when the number of standard patterns is small, and an average of spectral shapes in finer units when the number of standard patterns is large.
- Embodiment 8 FIG. The eighth embodiment of the present invention will be described below.
- the configuration of the eighth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the eighth embodiment may be the same as that of the other embodiments.
- the spectrum conversion unit 101 inputs signals from a plurality of microphones. Then, the sound processing apparatus performs the processes of steps S101 to S103 in parallel for each of the plurality of signals input by the spectrum conversion unit 101.
- the speech processing apparatus may output the estimated value having the highest reliability among the obtained estimated values, may output the average value of the obtained estimated values, or may obtain the obtained plural values.
- the maximum value of the estimated values may be output.
- the spectrum conversion unit 101 may input a signal after performing the sound source separation process.
- the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where a control program (speech processing program) that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention with a computer, a control program installed in the computer, a medium storing the control program, and a WWW (World Wide Web) server for downloading the control program are also included in the scope of the present invention. include.
- FIG. 11 is a block diagram showing the main part of the speech processing apparatus of the present invention.
- the speech processing apparatus of the present invention includes a distance calculation unit 102 and a desired speech spectrum expected value calculation unit 104.
- the distance calculation unit 102 includes a plurality of mixed signal spectrum standards calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively. The distance between the input signal and each pattern is calculated.
- the desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum in the input signal using the distance calculated by the distance calculation unit 102.
- the spectrum conversion part 101 is a speech processing apparatus which inputs the signal which the speech processing apparatus output as an input signal.
- the number of standard patterns is reduced in the first process, and the number of standard patterns is increased as the number of repetitions is repeated. It is possible to improve the accuracy of estimation of a desired speech spectrum to a fine estimation with high accuracy.
- the spectrum conversion unit 101 receives a plurality of input signals, and each unit of the sound processing device performs processing on the plurality of input signals in parallel.
- Speech processing apparatus 101 Spectrum conversion unit 102 Distance calculation unit 103 Standard pattern storage unit 104 Expected speech spectrum expected value calculation unit 201 Mixed signal spectrum expected value calculation unit 202 Desired speech enhancement filter calculation unit 203 Desired speech spectrum Estimating unit 301 Undesired speech spectrum expected value calculating unit 302 Noise spectrum expected value calculating unit 401 Reliability calculating unit 402 Mask setting unit 403 Mask applying unit
Abstract
Description
以下、本発明の第1の実施形態を図面を参照して説明する。 Embodiment 1. FIG.
A first embodiment of the present invention will be described below with reference to the drawings.
以下、本発明の第2の実施形態を図面を参照して説明する。 Embodiment 2. FIG.
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.
以下、本発明の第3の実施形態を図面を参照して説明する。 Embodiment 3. FIG.
Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.
以下、本発明の第4の実施形態を図面を参照して説明する。 Embodiment 4 FIG.
Hereinafter, a fourth embodiment of the present invention will be described with reference to the drawings.
以下、本発明の第5の実施形態を図面を参照して説明する。 Embodiment 5. FIG.
Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings.
以下、本発明の第6の実施形態を説明する。 Embodiment 6. FIG.
The sixth embodiment of the present invention will be described below.
以下、本発明の第7の実施形態を説明する。 Embodiment 7. FIG.
The seventh embodiment of the present invention will be described below.
以下、本発明の第8の実施形態を説明する。 Embodiment 8. FIG.
The eighth embodiment of the present invention will be described below.
101 スペクトル変換部
102 距離計算部
103 標準パタン格納部
104 所望音声スペクトル期待値算出部
201 混合信号スペクトル期待値算出部
202 所望音声強調フィルタ算出部
203 所望音声スペクトル推定部
301 所望外音声スペクトル期待値算出部
302 雑音スペクトル期待値算出部
401 信頼度算出部
402 マスク設定部
403 マスク付与部 100, 200, 300, 400
Claims (10)
- 複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する距離計算部と、
前記距離を用いて前記入力信号中の所望音声スペクトルの期待値を算出する所望音声スペクトル期待値算出部とを備えた
ことを特徴とする音声処理装置。 For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. A distance calculation unit for calculating a distance between the input signal and
A speech processing apparatus, comprising: a desired speech spectrum expected value calculation unit that calculates an expected value of a desired speech spectrum in the input signal using the distance. - 所望音声スペクトル期待値算出部は、距離計算部が算出した距離を用いて、複数の所望音声スペクトルの標準パタンを重み付き平均し、入力信号中の所望音声スペクトルの期待値を算出する
請求項1に記載の音声処理装置。 The desired speech spectrum expected value calculation unit calculates the expected value of the desired speech spectrum in the input signal by weighted averaging the standard patterns of a plurality of desired speech spectra using the distance calculated by the distance calculation unit. The voice processing apparatus according to 1. - 所望の話者により発声された所望音声と、前記所望音声以外の信号とが混合した混合信号を入力信号として入力し、前記入力信号を単位時間ごとにスペクトルに変換するスペクトル変換部と、
混合信号スペクトルの標準パタンと、対応する所望音声スペクトルの標準パタンとの組を複数格納する標準パタン格納部とを備え、
距離計算部は、前記入力信号のスペクトルと前記複数の混合信号スペクトルの標準パタンとの距離を算出し、
所望音声スペクトル期待値算出部は、前記距離を用いて、前記対応する所望音声スペクトルの標準パタンを重み付き平均し、所望音声スペクトルの期待値を出力情報として算出する
請求項1または請求項2に記載の音声処理装置。 A spectrum conversion unit that inputs a mixed signal in which a desired voice uttered by a desired speaker and a signal other than the desired voice are mixed as an input signal, and converts the input signal into a spectrum every unit time;
A standard pattern storage unit for storing a plurality of sets of standard patterns of the mixed signal spectrum and standard patterns of the corresponding desired speech spectrum;
The distance calculation unit calculates a distance between a spectrum of the input signal and a standard pattern of the plurality of mixed signal spectra,
The desired speech spectrum expected value calculation unit calculates a weighted average of the standard patterns of the corresponding desired speech spectrum using the distance, and calculates an expected value of the desired speech spectrum as output information. The speech processing apparatus according to the description. - 距離計算部が算出した距離を用いて、混合信号スペクトルの標準パタンを重み付き平均し、混合信号スペクトルの期待値を算出する混合信号スペクトル期待値算出部と、
所望音声スペクトルの期待値と前記混合信号スペクトルの期待値とから所望音声スペクトル強調フィルタを算出する所望音声スペクトル強調フィルタ算出部と、
前記所望音声スペクトル強調フィルタと入力信号のスペクトルとから所望音声スペクトルの推定値を出力情報として算出する所望音声スペクトル推定部とを備える
請求項1から請求項3のうちのいずれか1項に記載の音声処理装置。 Using the distance calculated by the distance calculation unit, the standard pattern of the mixed signal spectrum is weighted and averaged to calculate the expected value of the mixed signal spectrum;
A desired speech spectrum enhancement filter calculator that calculates a desired speech spectrum enhancement filter from an expected value of the desired speech spectrum and an expected value of the mixed signal spectrum;
The desired speech spectrum estimation part which calculates the estimated value of a desired speech spectrum as output information from the spectrum of the said desired speech spectrum emphasis filter and an input signal is given in any 1 paragraph of Claims 1-3. Audio processing device. - 所望外音声スペクトル期待値算出部と、雑音スペクトル期待値算出部とを備え、
標準パタン格納部は、混合信号スペクトルの標準パタンと、対応する所望音声スペクトルの標準パタンに加えて、所望外音声スペクトルの標準パタンと雑音スペクトルの標準パタンとの組を複数格納し、
前記所望外音声スペクトル期待値算出部は、距離計算部が算出した距離を用いて、前記所望外音声スペクトルの標準パタンを重み付き平均し、所望外音声スペクトルの期待値を算出し、
前記雑音スペクトル期待値算出部は、前記距離計算部が算出した距離を用いて、前記雑音スペクトルの標準パタンを重み付き平均し、雑音スペクトルの期待値を算出し、
所望音声スペクトル強調フィルタ算出部は、前記所望音声スペクトルの期待値と前記所望外音声スペクトルの期待値と前記雑音スペクトルの期待値とから所望音声スペクトル強調フィルタを算出し、
所望音声スペクトル推定部は、前記所望音声スペクトル強調フィルタと入力信号とから所望音声スペクトルの推定値を出力情報として算出する
請求項3に記載の音声処理装置。 An undesired speech spectrum expectation value calculator, and a noise spectrum expectation value calculator,
The standard pattern storage unit stores a plurality of sets of standard patterns of undesired speech spectrums and standard patterns of noise spectra in addition to standard patterns of mixed signal spectra and corresponding standard patterns of desired speech spectra,
The undesired speech spectrum expectation value calculation unit uses the distance calculated by the distance calculation unit to perform a weighted average of the standard patterns of the undesired speech spectrum and calculate an expected value of the undesired speech spectrum.
The noise spectrum expected value calculation unit uses the distance calculated by the distance calculation unit, calculates a weighted average of the standard pattern of the noise spectrum, and calculates an expected value of the noise spectrum,
The desired speech spectrum enhancement filter calculation unit calculates a desired speech spectrum enhancement filter from the expected value of the desired speech spectrum, the expected value of the non-desired speech spectrum, and the expected value of the noise spectrum,
The speech processing apparatus according to claim 3, wherein the desired speech spectrum estimation unit calculates an estimated value of a desired speech spectrum as output information from the desired speech spectrum enhancement filter and an input signal. - 距離計算部が算出した距離を用いて信頼度を算出する信頼度算出部と、
前記信頼度を用いてマスクを設定するマスク設定部と、
出力情報に前記マスクを加算して、新たな出力情報を生成するマスク付与部とを備える
請求項1から請求項5のうちのいずれか1項に記載の音声処理装置。 A reliability calculation unit that calculates reliability using the distance calculated by the distance calculation unit;
A mask setting unit that sets a mask using the reliability;
The speech processing apparatus according to claim 1, further comprising: a mask adding unit that adds the mask to the output information and generates new output information. - 標準パタン格納部は、標準パタンとして、音素単位またはスペクトルの形状が似ているものをクラスタリングした単位ごとに算出したスペクトルの平均を格納する
請求項1から請求項6のうちのいずれか1項に記載の音声処理装置。 The standard pattern storage unit stores, as the standard pattern, an average of spectra calculated for each unit obtained by clustering phoneme units or similar spectrum shapes. 7. The speech processing apparatus according to the description. - 標準パタン格納部は、標準パタンとして、音素単位またはスペクトルの形状が似ているものをクラスタリングした単位ごとに求めたスペクトルの出現頻度を表す確率密度関数を格納する
請求項1から請求項6のうちのいずれか1項に記載の音声処理装置。 The standard pattern storage unit stores, as a standard pattern, a probability density function representing a frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes. The speech processing device according to any one of the above. - 複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出し、
前記距離を用いて前記入力信号中の所望音声スペクトルの期待値を出力情報として算出する
ことを特徴とする音声処理方法。 For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. , Calculate the distance between the input signal and
An audio processing method, wherein an expected value of a desired audio spectrum in the input signal is calculated as output information using the distance. - コンピュータに、
複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する処理と、
前記距離を用いて前記入力信号中の所望音声スペクトルの期待値を出力情報として算出する処理とを実行させる
ための音声処理プログラム。 On the computer,
For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. A process of calculating the distance between the input signal and
A sound processing program for executing a process of calculating an expected value of a desired sound spectrum in the input signal as output information using the distance.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2012081586 | 2012-03-30 | ||
JP2012-081586 | 2012-03-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013145578A1 true WO2013145578A1 (en) | 2013-10-03 |
Family
ID=49258898
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/001448 WO2013145578A1 (en) | 2012-03-30 | 2013-03-07 | Audio processing device, audio processing method, and audio processing program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2013145578A1 (en) |
WO (1) | WO2013145578A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018031967A (en) * | 2016-08-26 | 2018-03-01 | 日本電信電話株式会社 | Sound source enhancement device, and method and program for the same |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003005785A (en) * | 2001-06-26 | 2003-01-08 | National Institute Of Advanced Industrial & Technology | Separating method and separating device for sound source |
JP2003177783A (en) * | 2001-12-11 | 2003-06-27 | Toshiba Corp | Voice recognition device, voice recognition system, and voice recognition program |
JP2005077731A (en) * | 2003-08-29 | 2005-03-24 | Univ Waseda | Sound source separating method and system therefor, and speech recognizing method and system therefor |
JP2005084071A (en) * | 2003-09-04 | 2005-03-31 | Kddi Corp | Speech recognizing apparatus |
JP2006510060A (en) * | 2002-12-13 | 2006-03-23 | 三菱電機株式会社 | Method and system for separating a plurality of acoustic signals generated by a plurality of acoustic sources |
JP2007033920A (en) * | 2005-07-27 | 2007-02-08 | Nec Corp | System, method, and program for noise suppression |
-
2013
- 2013-03-07 JP JP2014507376A patent/JPWO2013145578A1/en active Pending
- 2013-03-07 WO PCT/JP2013/001448 patent/WO2013145578A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003005785A (en) * | 2001-06-26 | 2003-01-08 | National Institute Of Advanced Industrial & Technology | Separating method and separating device for sound source |
JP2003177783A (en) * | 2001-12-11 | 2003-06-27 | Toshiba Corp | Voice recognition device, voice recognition system, and voice recognition program |
JP2006510060A (en) * | 2002-12-13 | 2006-03-23 | 三菱電機株式会社 | Method and system for separating a plurality of acoustic signals generated by a plurality of acoustic sources |
JP2005077731A (en) * | 2003-08-29 | 2005-03-24 | Univ Waseda | Sound source separating method and system therefor, and speech recognizing method and system therefor |
JP2005084071A (en) * | 2003-09-04 | 2005-03-31 | Kddi Corp | Speech recognizing apparatus |
JP2007033920A (en) * | 2005-07-27 | 2007-02-08 | Nec Corp | System, method, and program for noise suppression |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018031967A (en) * | 2016-08-26 | 2018-03-01 | 日本電信電話株式会社 | Sound source enhancement device, and method and program for the same |
Also Published As
Publication number | Publication date |
---|---|
JPWO2013145578A1 (en) | 2015-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Luo et al. | Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation | |
Bahmaninezhad et al. | A comprehensive study of speech separation: spectrogram vs waveform separation | |
Kwon et al. | NMF-based speech enhancement using bases update | |
US20110125496A1 (en) | Speech recognition device, speech recognition method, and program | |
US8680386B2 (en) | Signal processing device, signal processing method, and program | |
Grais et al. | Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders | |
JP5375400B2 (en) | Audio processing apparatus, audio processing method and program | |
US20190172442A1 (en) | Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system | |
CN103559888A (en) | Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle | |
JP6195548B2 (en) | Signal analysis apparatus, method, and program | |
JP7176627B2 (en) | Signal extraction system, signal extraction learning method and signal extraction learning program | |
EP2912660A1 (en) | Method for determining a dictionary of base components from an audio signal | |
Sharma et al. | Study of robust feature extraction techniques for speech recognition system | |
JPWO2015129760A1 (en) | Signal processing apparatus, method and program | |
CN110797033A (en) | Artificial intelligence-based voice recognition method and related equipment thereof | |
JPWO2014168022A1 (en) | Signal processing apparatus, signal processing method, and signal processing program | |
CN108369803B (en) | Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model | |
JP6348427B2 (en) | Noise removal apparatus and noise removal program | |
Wiem et al. | Unsupervised single channel speech separation based on optimized subspace separation | |
JP5974901B2 (en) | Sound segment classification device, sound segment classification method, and sound segment classification program | |
JP5994639B2 (en) | Sound section detection device, sound section detection method, and sound section detection program | |
JP2020060757A (en) | Speaker recognition device, speaker recognition method, and program | |
JP5726790B2 (en) | Sound source separation device, sound source separation method, and program | |
KR101802444B1 (en) | Robust speech recognition apparatus and method for Bayesian feature enhancement using independent vector analysis and reverberation parameter reestimation | |
WO2013145578A1 (en) | Audio processing device, audio processing method, and audio processing program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13767645 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2014507376 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13767645 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |