WO2013145578A1 - Audio processing device, audio processing method, and audio processing program - Google Patents

Audio processing device, audio processing method, and audio processing program Download PDF

Info

Publication number
WO2013145578A1
WO2013145578A1 PCT/JP2013/001448 JP2013001448W WO2013145578A1 WO 2013145578 A1 WO2013145578 A1 WO 2013145578A1 JP 2013001448 W JP2013001448 W JP 2013001448W WO 2013145578 A1 WO2013145578 A1 WO 2013145578A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectrum
speech
expected value
standard pattern
distance
Prior art date
Application number
PCT/JP2013/001448
Other languages
French (fr)
Japanese (ja)
Inventor
隆行 荒川
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2013145578A1 publication Critical patent/WO2013145578A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present invention relates to a voice processing apparatus, a voice processing method, and a voice processing program that extract only a desired voice in a noisy environment or an environment where a plurality of people are talking.
  • Patent Document 1 is effective when the performance of sound source separation is not sufficient due to the presence of non-directional environmental noise, reverberation, or the existence of sound sources with more than the number of microphones. I can't expect it.
  • Patent Document 2 when the performance of sound source separation is not sufficient, a large value mask is performed and the desired sound is also erased.
  • Patent Document 3 when the estimated noise value is low and the temporary estimated speech is significantly different from the original speech, the correction using the standard pattern may fail.
  • an object of the present invention is to provide an audio processing device, an audio processing method, and an audio processing program capable of accurately acquiring only desired audio from a signal in which desired audio and other signals are mixed.
  • the speech processing apparatus includes a plurality of mixed signal spectra calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively.
  • a distance calculation unit that calculates a distance to the input signal for each of the standard patterns, and a desired speech spectrum expected value calculation unit that calculates an expected value of the desired speech spectrum in the input signal using the distance. It is characterized by having.
  • the speech processing method includes a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns, a distance from the input signal is calculated, and an expected value of a desired speech spectrum in the input signal is calculated as output information using the distance.
  • An audio processing program is calculated by combining a computer with a standard pattern of a plurality of desired audio spectra, a standard pattern of a plurality of undesired audio spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns of the mixed signal spectrum, a process for calculating the distance to the input signal and a process for calculating the expected value of the desired speech spectrum in the input signal as output information using the distance are executed. It is characterized by that.
  • the present invention it is possible to accurately acquire only the desired voice from the signal in which the desired voice and other signals are mixed. In addition, it is possible to obtain a speech recognition result for only desired speech.
  • Embodiment 1 FIG. A first embodiment of the present invention will be described below with reference to the drawings.
  • FIG. 1 is a block diagram showing a configuration of a first embodiment of a speech processing apparatus according to the present invention.
  • the speech processing apparatus 100 includes a spectrum conversion unit 101, a distance calculation unit 102, a standard pattern storage unit 103, and a desired speech spectrum expected value calculation unit 104.
  • the speech processing apparatus 100 receives a mixed signal in which desired speech and other signals are mixed, and outputs an estimated value of the desired speech.
  • the signal other than the desired voice includes, for example, voice due to speech other than the desired speaker and noise other than voice.
  • a speech spectrum based on a desired speaker's utterance is referred to as a desired speech spectrum.
  • a speech spectrum generated by speech other than the desired speaker is referred to as an undesired speech spectrum.
  • the spectrum of a noise signal other than speech is called a noise spectrum.
  • the spectrum conversion unit 101 inputs a mixed signal in which desired speech and other signals are mixed.
  • the spectrum conversion unit 101 acquires a spectrum vector of the mixed signal.
  • the distance calculation unit 102 calculates the distance between the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the standard pattern of the mixed signal spectrum stored in the standard pattern storage unit 103.
  • the standard pattern storage unit 103 stores the standard pattern of the mixed signal spectrum.
  • FIG. 2 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103.
  • the standard pattern storage unit 103 stores a set of a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum. Has been.
  • the standard pattern of the spectrum expresses an average shape for a phoneme unit such as “A”, “I”, “U” or a unit obtained by clustering similar spectral shapes.
  • represents a vector whose elements are the average value of the spectrum size for each frequency band.
  • the subscripts Y, A, B, and N indicate a mixed signal spectrum, a desired speech spectrum, an undesired speech spectrum, and a noise spectrum, respectively.
  • the subscripts i, j, and k are serial numbers representing phoneme units or clustering units, respectively.
  • ⁇ Yijk represents a mixed signal spectrum obtained by mixing the i-th desired speech spectrum, the j-th undesired speech spectrum, and the k-th noise spectrum.
  • FIG. 3 is an explanatory diagram showing the relationship between ⁇ Yijk and ⁇ Ai , ⁇ Bj , ⁇ Nk .
  • the horizontal axis represents the frequency band
  • the vertical axis represents the magnitude of the spectrum.
  • ⁇ Yijk and ⁇ Ai , ⁇ Bj , ⁇ Nk have the relationship shown in Equation 1 for each frequency waveband .
  • the number of standard patterns of the mixed signal spectrum is I ⁇ J ⁇ K. That is, the standard pattern storage unit 103 stores the I ⁇ J ⁇ K combinations.
  • standard patterns are prepared in advance by using statistical processing and machine learning techniques, and are stored in the standard pattern storage unit 103.
  • the standard pattern of the undesired voice spectrum is created based on voice data of an unspecified speaker, for example.
  • the standard pattern of the desired speech spectrum is created based on, for example, speech data of a specific speaker.
  • the desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum as output information using the distance calculated by the distance calculation unit 102.
  • the spectrum conversion unit 101, the distance calculation unit 102, and the desired speech spectrum expected value calculation unit 104 are realized by a CPU (Central Processing Unit) provided in the speech processing apparatus 100. Further, the standard pattern storage unit 103 is realized by a storage device such as a memory provided in the sound processing device 100.
  • a CPU Central Processing Unit
  • the standard pattern storage unit 103 is realized by a storage device such as a memory provided in the sound processing device 100.
  • FIG. 4 is a flowchart showing the operation of the first embodiment of the speech processing apparatus.
  • the spectrum conversion unit 101 performs short-time spectrum conversion on the mixed signal input by the speech processing apparatus 100.
  • the spectrum conversion unit 101 obtains a spectrum for each frequency band, for example, every unit time such as 10 milliseconds (step S101).
  • the speech processing apparatus 100 performs the processes in steps S102 to S103 on a vector having as an element a spectrum for each frequency band acquired every unit time or a subband spectrum in which a plurality of frequency bands are bundled. Assumed to be performed. Let Y be the spectral vector of the mixed signal.
  • the distance calculation unit 102 calculates a distance d ijk between the spectrum vector Y of the mixed signal and the standard pattern ⁇ Yijk of the mixed signal spectrum stored in the standard pattern storage unit 103 (step S102).
  • the distance is calculated for the number of standard pattern groups (I ⁇ J ⁇ K).
  • the distance calculation unit 102 may obtain a general distance between vectors, for example, a Euclidean distance as the distance.
  • the distance calculation unit 102 may obtain the distance after converting the spectrum vector into a logarithmic spectrum vector. Further, the distance calculation unit 102 may obtain the distance after converting into a feature amount generally used in speech recognition such as a cepstrum.
  • the distance d ijk is a scalar quantity.
  • desired speech spectrum expected value calculation section 104 calculates an expected value of the desired speech spectrum (step S103).
  • Desired expected value E A of the speech spectrum is calculated by a weighted average of the standard patterns mu Ai desired speech spectrum.
  • E A is a vector quantity. The weight is set to be small when the distance d ijk is large and to be large when dijk is small. Equation 2 is an example of a formula for E A.
  • Speech processing apparatus 100 specifically, the output unit the speech processing apparatus 100 (not shown), based on the E A is the output information and outputs the voice recognition results.
  • the audio processing apparatus 100 may output a E A as an estimate of the desired sound, perform such inverse Fourier transform, may be converted and output into waveform signals (wave Signal), waveform Speech recognition (speech recognition) may be performed on the signal, and the signal may be converted into text and output.
  • waveform signals waveform Signal
  • speech recognition speech recognition
  • the standard pattern storage unit 103 stores a set of four standard patterns: a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum.
  • a standard pattern of a mixed signal spectrum a standard pattern of a mixed signal spectrum
  • a standard pattern of a desired speech spectrum a standard pattern of a desired speech spectrum
  • a standard pattern of an undesired speech spectrum a standard pattern of a noise spectrum.
  • other storage methods are possible.
  • the standard pattern storage unit 103 may store only a set of a standard pattern of a mixed signal spectrum and a standard pattern of a desired speech spectrum (a set A of standard patterns shown in FIG. 2). This is because only these two standard patterns are necessary when the desired speech spectrum is actually extracted from the input signal. The other two standard patterns are required only when the standard pattern of the mixed signal is created in advance. Also in this case, a set of I ⁇ J ⁇ K standard patterns is stored in the standard pattern storage unit 103. Specifically, a set of standard patterns of J ⁇ K mixed signal spectra is stored for one standard pattern of desired speech spectrum.
  • the standard pattern of the undesired speech spectrum and the standard pattern of the noise spectrum (standard pattern set B shown in FIG. 2) is stored in the standard pattern storage unit 103. Good. This is because the standard pattern of the mixed signal spectrum can be calculated using Equation 1 when extracting the desired sound from the input signal.
  • the speech processing apparatus can only perform speech from a desired speaker from a speech mixed with speech from a desired speaker, speech from other than the desired speaker, and noise. Can be obtained accurately.
  • FIG. 5 is a block diagram showing the configuration of the second embodiment of the speech processing apparatus according to the present invention.
  • the speech processing apparatus 200 of the present embodiment includes a mixed signal spectrum expected value calculation unit 201, a desired speech enhancement filter calculation unit 202, And a desired speech spectrum estimation unit 203.
  • the mixed signal spectrum expected value calculation unit 201 calculates the expected value of the mixed signal spectrum using the distance calculated by the distance calculation unit 102.
  • the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum and the expected value of the mixed signal spectrum.
  • the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum as output information based on the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the desired speech enhancement filter.
  • the mixed signal spectrum expected value calculation unit 201, the desired speech enhancement filter calculation unit 202, and the desired speech spectrum estimation unit 203 are realized by a CPU provided in the speech processing apparatus 200.
  • the spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.
  • FIG. 6 is a flowchart showing the operation of the second embodiment of the speech processing apparatus.
  • steps S201 to S203 Since the processing of steps S201 to S203 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.
  • the mixed signal spectrum expected value calculation unit 201 calculates an expected value of the mixed signal spectrum (step S204).
  • the expected value E Y of the mixed signal spectrum is calculated as a weighted average of the standard pattern ⁇ Yijk of the mixed signal spectrum.
  • E Y is a vector quantity.
  • the calculation method in step S204 is the same as that in step S103 of the first embodiment. Equation 3 is an example of a formula for E Y.
  • the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S205).
  • Formula 4 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 4 is performed for each vector element. W is a vector quantity.
  • the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S206). Desired calculation of the speech spectrum estimate F A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element.
  • F A is a vector quantity. Equation 5 is an example of a formula for F A.
  • the speech processing apparatus 200 may output the estimated value of the desired speech spectrum, which is output information, as it is, or performs inverse Fourier transform to return the waveform signal (wave signal) to the output form.
  • speech recognition speech recognition
  • the waveform signal may be converted into text and output.
  • the desired speech enhancement filter calculating section 202 for example, a plurality unit time value of the desired speech spectrum expectation E A and mixed signal spectrum expectation E Y Cross averaging or smoothing may be performed.
  • the W value calculated using Equation 4 may be averaged or smoothed over a plurality of unit times.
  • averaging and smoothing are performed at the stage of desired speech enhancement filter calculation. This makes it robust against instantaneous estimation errors. Therefore, in addition to the effects of the first embodiment, a more reliable output can be obtained.
  • FIG. 7 is a block diagram showing the configuration of the third embodiment of the speech processing apparatus according to the present invention.
  • the speech processing apparatus 300 of the present embodiment includes an undesired speech spectrum expected value calculation unit 301 and a noise spectrum expected value calculation unit 302. Including. Furthermore, the speech processing apparatus 300 includes a desired speech enhancement filter calculation unit 202 and a desired speech spectrum estimation unit 203.
  • the undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum using the distance calculated by the distance calculation unit 102.
  • the expected noise spectrum value calculation unit 302 uses the distance calculated by the distance calculation unit 102 to calculate the expected value of the noise spectrum.
  • the undesired speech spectrum expected value calculation unit 301 and the noise spectrum expected value calculation unit 302 are realized by a CPU included in the speech processing apparatus 300.
  • the spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.
  • desired speech enhancement filter calculation unit 202 and the desired speech spectrum estimation unit 203 are the same as those in the second embodiment, and thus description thereof is omitted.
  • the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum, the expected value of the undesired speech spectrum, and the expected value of the noise spectrum. .
  • FIG. 8 is a flowchart showing the operation of the third embodiment of the speech processing apparatus.
  • steps S301 to S303 Since the processing of steps S301 to S303 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.
  • the undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum (step S304).
  • Expected value E B of the desired outside the speech spectrum is calculated by the weighted average of the standard patterns mu Bj desired outside the speech spectrum.
  • E B is a vector quantity.
  • the calculation method in step S304 is the same as that in step S103 in the first embodiment. Equation 6 is an example of a formula for E B.
  • the expected noise spectrum value calculation unit 302 calculates the expected value of the noise spectrum (step S305).
  • the expected value E N of the noise spectrum is calculated as a weighted average of the standard pattern ⁇ Nk of the noise spectrum.
  • E N is a vector quantity.
  • the calculation method in step S305 is the same as that in step S103 of the first embodiment. Equation 7 is an example of a formula for E N.
  • the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S306).
  • Formula 8 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 8 is performed for each element of the vector. W is a vector quantity.
  • the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S307). Desired calculation of the speech spectrum estimate F A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element. F A is a vector quantity.
  • the speech processing apparatus 300 may output the estimated value of the desired speech spectrum, which is output information, as it is, may perform inverse Fourier transform, return to the waveform signal form, or output the waveform signal. Speech recognition may be performed, converted into text, and output.
  • the desired speech enhancement filter calculating section 202 for example, the desired speech spectrum expectation E A and the desired extracellular signal spectrum expectation E B and the noise signal spectrum expected value the value of E N may be averaged or smoothed over a plurality of units of time.
  • the value of W calculated using Equation 8 may be averaged or smoothed over a plurality of unit times.
  • FIG. 9 is a block diagram showing the configuration of the fourth embodiment of the speech processing apparatus according to the present invention.
  • the speech processing apparatus 400 of the present embodiment includes a reliability calculation unit 401, a mask setting unit 402, and a mask applying unit 403. including.
  • the reliability calculation unit 401 calculates the reliability using the distance calculated by the distance calculation unit 102.
  • entropy is given as an example of reliability.
  • Equation 10 is an example of a formula for calculating entropy H.
  • Entropy is an indicator of randomness, and the greater the value, the lower the reliability.
  • Entropy is a scalar quantity.
  • the mask setting unit 402 sets a mask value according to the reliability.
  • the mask setting unit 402 sets a large mask value when the reliability is low, and sets a small mask value when the reliability is high.
  • the value of the entropy H itself or an amount obtained by multiplying the entropy H by a positive multiplier can be used as the mask M.
  • a mask is a vector quantity.
  • the mask values may all be the same value for each frequency band, or may be different values.
  • the mask setting unit 402 sets all the mask values to the same value.
  • Expression 11 is an expression showing an example of the relationship between the mask M and the entropy H.
  • the mask assigning unit 403 calculates an estimated value F A ′ of the desired speech spectrum added with the mask as output information.
  • Formula 12 is an example of a formula for calculating F A ′.
  • the reliability calculation unit 401, the mask setting unit 402, and the mask applying unit 403 are realized by a CPU provided in the voice processing device 400.
  • the reliability of the estimated value of the estimated desired speech spectrum is low and other signals may be mixed. When it is high, the adverse effects of other signals can be suppressed.
  • the reliability calculation unit 401 calculates the reliability using a distance from a standard pattern prepared in advance. Therefore, it can be expected that the reliability in the present embodiment is a more accurate value than the reliability used in the method described in Patent Document 2.
  • the configuration of the fifth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the fifth embodiment may be the same as that of the other embodiments.
  • FIG. 10 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103 according to the fifth embodiment.
  • the standard pattern storage unit 103 stores a probability density function (Probability Density Function) instead of an average value as a standard pattern of a mixed signal spectrum, as shown in FIG. Specifically, the standard pattern storage unit 103 stores a probability density function representing the frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes as the standard pattern of the mixed signal spectrum. To do.
  • a probability density function Probability Density Function
  • a Gaussian distribution function for example, a Gaussian distribution function, a mixed Gaussian distribution function, or the like can be considered.
  • a Gaussian distribution function a Gaussian distribution function
  • a variance value is used in addition to the average value.
  • the distance calculation unit 102 uses the Batacharya distance, the Mahalanobis distance, the likelihood, the log likelihood, and the like as the distance scale.
  • the standard pattern of the mixed signal spectrum can be expressed using a higher-order statistic such as variance in addition to the average, it is possible to estimate the desired speech with higher accuracy.
  • Embodiment 6 FIG. The sixth embodiment of the present invention will be described below.
  • the configuration of the sixth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the sixth embodiment may be the same as that of the other embodiments.
  • the way of storing a set of standard patterns stored in the standard pattern storage unit 103 is different.
  • the standard pattern storage unit 103 includes three types of standard patterns I of desired speech spectrum, J standard patterns of undesired speech spectrum, and K standard patterns of noise spectrum.
  • the standard patterns of I ⁇ J ⁇ K mixed signal spectra created by combining one standard pattern each are stored.
  • the standard pattern storage unit 103 creates an L created by clustering and summing up spectrum patterns having similar shapes to the standard patterns of I ⁇ J ⁇ K mixed signal spectra. Stores the standard pattern of the mixed signal spectrum.
  • the number of standard patterns stored in the standard pattern storage unit 103 is changed by collecting standard patterns. Therefore, the number of standard patterns can be changed according to the calculation resources, memory resources, and the like of the voice processing device.
  • Embodiment 7 FIG. The seventh embodiment of the present invention will be described below.
  • the configuration of the seventh embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the seventh embodiment may be the same as that of the other embodiments.
  • the spectrum conversion unit 101 inputs the signal output from the voice processing device again. Then, the sound processing device repeats the processes of steps S101 to S103.
  • the speech processing apparatus may reduce the number of standard patterns in the first processing and increase the number of standard patterns as the number of times is repeated.
  • the estimated value of the desired speech is an average of coarse spectral shapes when the number of standard patterns is small, and an average of spectral shapes in finer units when the number of standard patterns is large.
  • Embodiment 8 FIG. The eighth embodiment of the present invention will be described below.
  • the configuration of the eighth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the eighth embodiment may be the same as that of the other embodiments.
  • the spectrum conversion unit 101 inputs signals from a plurality of microphones. Then, the sound processing apparatus performs the processes of steps S101 to S103 in parallel for each of the plurality of signals input by the spectrum conversion unit 101.
  • the speech processing apparatus may output the estimated value having the highest reliability among the obtained estimated values, may output the average value of the obtained estimated values, or may obtain the obtained plural values.
  • the maximum value of the estimated values may be output.
  • the spectrum conversion unit 101 may input a signal after performing the sound source separation process.
  • the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where a control program (speech processing program) that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention with a computer, a control program installed in the computer, a medium storing the control program, and a WWW (World Wide Web) server for downloading the control program are also included in the scope of the present invention. include.
  • FIG. 11 is a block diagram showing the main part of the speech processing apparatus of the present invention.
  • the speech processing apparatus of the present invention includes a distance calculation unit 102 and a desired speech spectrum expected value calculation unit 104.
  • the distance calculation unit 102 includes a plurality of mixed signal spectrum standards calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively. The distance between the input signal and each pattern is calculated.
  • the desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum in the input signal using the distance calculated by the distance calculation unit 102.
  • the spectrum conversion part 101 is a speech processing apparatus which inputs the signal which the speech processing apparatus output as an input signal.
  • the number of standard patterns is reduced in the first process, and the number of standard patterns is increased as the number of repetitions is repeated. It is possible to improve the accuracy of estimation of a desired speech spectrum to a fine estimation with high accuracy.
  • the spectrum conversion unit 101 receives a plurality of input signals, and each unit of the sound processing device performs processing on the plurality of input signals in parallel.
  • Speech processing apparatus 101 Spectrum conversion unit 102 Distance calculation unit 103 Standard pattern storage unit 104 Expected speech spectrum expected value calculation unit 201 Mixed signal spectrum expected value calculation unit 202 Desired speech enhancement filter calculation unit 203 Desired speech spectrum Estimating unit 301 Undesired speech spectrum expected value calculating unit 302 Noise spectrum expected value calculating unit 401 Reliability calculating unit 402 Mask setting unit 403 Mask applying unit

Abstract

Provided are an audio processing device, audio processing method, and audio processing program that can accurately acquire solely desired audio from a signal in which the desired audio and other signals are mixed. The audio processing device is provided with: a distance computation unit (102) for calculating the distance from an input signal to standard patterns of a plurality of mixed signal spectra calculated by respectively combining the standard patterns of a plurality of desired audio spectra, the standard patterns of a plurality of undesired audio spectra, and the standard patterns of a plurality of noise audio spectra; and a desired audio spectrum expected value calculation unit (104) for calculating an expected value of a desired audio spectrum within the input signal, by using the distances calculated by the distance computation unit (102).

Description

音声処理装置、音声処理方法および音声処理プログラムAudio processing apparatus, audio processing method, and audio processing program
 本発明は、雑音環境もしくは複数人が会話している環境において、所望の音声のみを抽出する音声処理装置、音声処理方法および音声処理プログラムに関する。 The present invention relates to a voice processing apparatus, a voice processing method, and a voice processing program that extract only a desired voice in a noisy environment or an environment where a plurality of people are talking.
 雑音環境もしくは複数人が会話している環境において、所望の音声のみを抽出する技術がある。そのような技術分野において、複数の音源からの信号を複数のマイクで取得し、得られた混合信号に対して独立成分分析を用いて音源分離を行い、分離された信号それぞれに対して音声認識を行い、音声認識結果と共に得られる音声認識の信頼度情報を用いて想定したタスク内の音声であるか否かを判定する方法がある(例えば、特許文献1参照。)。また、複数の音源からの信号を複数のマイクで取得し、得られた混合信号に対して音源分離を行い、分離された信号の分離信頼度を用いて信号をマスクすることで、消し残しを目立たなくする方法がある(例えば、特許文献2参照。)。また、混合信号から雑音の推定値を除去し仮推定音声を求め、標準パタンを用いて仮推定音声を補正する方法がある(例えば、特許文献3参照。)。 There is a technology that extracts only the desired voice in a noisy environment or an environment where multiple people are talking. In such a technical field, signals from multiple sound sources are acquired by multiple microphones, sound source separation is performed on the obtained mixed signal using independent component analysis, and speech recognition is performed for each of the separated signals. There is a method of determining whether or not the voice is in a task assumed by using the reliability information of voice recognition obtained together with the voice recognition result (see, for example, Patent Document 1). Also, signals from multiple sound sources are acquired by multiple microphones, sound source separation is performed on the obtained mixed signal, and the signal is masked using the separation reliability of the separated signals, so that unerased parts are left out. There is a method of making it inconspicuous (see, for example, Patent Document 2). In addition, there is a method in which a noise estimation value is removed from a mixed signal to obtain temporary estimated speech, and the temporary estimated speech is corrected using a standard pattern (see, for example, Patent Document 3).
特開2011-107603号公報JP 2011-107603 A 特開2010-49249号公報JP 2010-49249 A 特開2007-33920号公報JP 2007-33920 A
 しかし、特許文献1に記載された方法では、方向性の無い環境雑音がある、残響がある、マイク数よりも多い音源が存在するなどの理由で音源分離の性能が十分でない場合には効果が期待できない。特許文献2に記載された方法では、音源分離の性能が十分でない場合には、大きな値のマスクを行うことになり所望音声も消してしまう。特許文献3に記載された方法では、雑音の推定値の精度が低く仮推定音声が本来の音声と大きく異なる場合には、標準パタンを用いた補正に失敗することがある。 However, the method described in Patent Document 1 is effective when the performance of sound source separation is not sufficient due to the presence of non-directional environmental noise, reverberation, or the existence of sound sources with more than the number of microphones. I can't expect it. In the method described in Patent Document 2, when the performance of sound source separation is not sufficient, a large value mask is performed and the desired sound is also erased. In the method described in Patent Document 3, when the estimated noise value is low and the temporary estimated speech is significantly different from the original speech, the correction using the standard pattern may fail.
 そこで、本発明は、所望の音声とそれ以外の信号とが混在する信号から、所望の音声のみを正確に取得することができる音声処理装置、音声処理方法および音声処理プログラムを提供することを目的とする。 In view of the above, an object of the present invention is to provide an audio processing device, an audio processing method, and an audio processing program capable of accurately acquiring only desired audio from a signal in which desired audio and other signals are mixed. And
 本発明による音声処理装置は、複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する距離計算部と、距離を用いて入力信号中の所望音声スペクトルの期待値を算出する所望音声スペクトル期待値算出部とを備えたことを特徴とする。 The speech processing apparatus according to the present invention includes a plurality of mixed signal spectra calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively. A distance calculation unit that calculates a distance to the input signal for each of the standard patterns, and a desired speech spectrum expected value calculation unit that calculates an expected value of the desired speech spectrum in the input signal using the distance. It is characterized by having.
 本発明による音声処理方法は、複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出し、距離を用いて入力信号中の所望音声スペクトルの期待値を出力情報として算出することを特徴とする。 The speech processing method according to the present invention includes a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns, a distance from the input signal is calculated, and an expected value of a desired speech spectrum in the input signal is calculated as output information using the distance.
 本発明による音声処理プログラムは、コンピュータに、複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する処理と、距離を用いて入力信号中の所望音声スペクトルの期待値を出力情報として算出する処理とを実行させることを特徴とする。 An audio processing program according to the present invention is calculated by combining a computer with a standard pattern of a plurality of desired audio spectra, a standard pattern of a plurality of undesired audio spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns of the mixed signal spectrum, a process for calculating the distance to the input signal and a process for calculating the expected value of the desired speech spectrum in the input signal as output information using the distance are executed. It is characterized by that.
 本発明によれば、所望の音声とそれ以外の信号とが混在する信号から、所望の音声のみを正確に取得することができる。また、所望の音声のみに対する音声認識結果を得ることができる。 According to the present invention, it is possible to accurately acquire only the desired voice from the signal in which the desired voice and other signals are mixed. In addition, it is possible to obtain a speech recognition result for only desired speech.
本発明による音声処理装置の第1の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of the audio processing apparatus by this invention. 標準パタン格納部が格納する情報の一例を示す説明図である。It is explanatory drawing which shows an example of the information which a standard pattern storage part stores. μYijkとμAi,μBj,μNkとの関係を示す説明図である。mu Yijk and μ Ai, μ Bj, is an explanatory diagram showing the relationship between mu Nk. 音声処理装置の第1の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 1st Embodiment of a speech processing unit. 本発明による音声処理装置の第2の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 2nd Embodiment of the audio processing apparatus by this invention. 音声処理装置の第2の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 2nd Embodiment of a speech processing unit. 本発明による音声処理装置の第3の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 3rd Embodiment of the audio processing apparatus by this invention. 音声処理装置の第3の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 3rd Embodiment of a speech processing unit. 本発明による音声処理装置の第4の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 4th Embodiment of the audio processing apparatus by this invention. 第5の実施形態における標準パタン格納部が格納する情報の一例を示す説明図である。It is explanatory drawing which shows an example of the information which the standard pattern storage part in 5th Embodiment stores. 本発明による音声処理装置の主要部を示すブロック図である。It is a block diagram which shows the principal part of the audio processing apparatus by this invention.
実施形態1.
 以下、本発明の第1の実施形態を図面を参照して説明する。
Embodiment 1. FIG.
A first embodiment of the present invention will be described below with reference to the drawings.
 図1は、本発明による音声処理装置の第1の実施形態の構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a first embodiment of a speech processing apparatus according to the present invention.
 図1に示すように、音声処理装置100は、スペクトル変換部101と、距離計算部102と、標準パタン格納部103と、所望音声スペクトル期待値算出部104とを含む。 As shown in FIG. 1, the speech processing apparatus 100 includes a spectrum conversion unit 101, a distance calculation unit 102, a standard pattern storage unit 103, and a desired speech spectrum expected value calculation unit 104.
 音声処理装置100は、所望音声とそれ以外の信号とが混在する混合信号を入力とし、所望音声の推定値を出力とする。 The speech processing apparatus 100 receives a mixed signal in which desired speech and other signals are mixed, and outputs an estimated value of the desired speech.
 所望音声以外の信号には、例えば、所望の話者以外の発話による音声、および、音声以外の雑音が含まれる。以下、所望の話者の発話による音声スペクトルを所望音声スペクトルという。また、所望の話者以外の発話による音声スペクトルを所望外音声スペクトルという。また、音声以外の雑音信号のスペクトルを雑音スペクトルという。 The signal other than the desired voice includes, for example, voice due to speech other than the desired speaker and noise other than voice. Hereinafter, a speech spectrum based on a desired speaker's utterance is referred to as a desired speech spectrum. In addition, a speech spectrum generated by speech other than the desired speaker is referred to as an undesired speech spectrum. The spectrum of a noise signal other than speech is called a noise spectrum.
 スペクトル変換部101は、所望音声とそれ以外の信号とが混在する混合信号を入力する。スペクトル変換部101は、混合信号のスペクトルベクトルを取得する。 The spectrum conversion unit 101 inputs a mixed signal in which desired speech and other signals are mixed. The spectrum conversion unit 101 acquires a spectrum vector of the mixed signal.
 距離計算部102は、スペクトル変換部101が取得した混合信号のスペクトルベクトルと、標準パタン格納部103に格納されている混合信号スペクトルの標準パタンとの距離を計算する。 The distance calculation unit 102 calculates the distance between the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the standard pattern of the mixed signal spectrum stored in the standard pattern storage unit 103.
 標準パタン格納部103は、混合信号スペクトルの標準パタンを格納する。 The standard pattern storage unit 103 stores the standard pattern of the mixed signal spectrum.
 図2は、標準パタン格納部103が格納する情報の一例を示す説明図である。標準パタン格納部103には、図2に示すように、混合信号スペクトルの標準パタンと、所望音声スペクトルの標準パタンと、所望外音声スペクトルの標準パタンと、雑音スペクトルの標準パタンとの組が格納されている。 FIG. 2 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103. As shown in FIG. 2, the standard pattern storage unit 103 stores a set of a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum. Has been.
 スペクトルの標準パタンは、「あ」、「い」、「う」といった音素単位またはスペクトル形状の似ているものをクラスタリングした単位に対して、平均的な形状を表現するものである。μは、周波数帯域ごとのスペクトルの大きさの平均値を要素にもつベクトルを示す。添え字Y,A,B,Nは、それぞれ、混合信号スペクトル、所望音声スペクトル、所望外音声スペクトル、雑音スペクトルであることを示す。添え字i,j,kは、それぞれ、音素単位もしくはクラスタリング単位を表す通し番号である。μYijkは、i番目の所望音声スペクトルと、j番目の所望外音声スペクトルと、k番目の雑音スペクトルを混合した混合信号スペクトルを示す。 The standard pattern of the spectrum expresses an average shape for a phoneme unit such as “A”, “I”, “U” or a unit obtained by clustering similar spectral shapes. μ represents a vector whose elements are the average value of the spectrum size for each frequency band. The subscripts Y, A, B, and N indicate a mixed signal spectrum, a desired speech spectrum, an undesired speech spectrum, and a noise spectrum, respectively. The subscripts i, j, and k are serial numbers representing phoneme units or clustering units, respectively. μ Yijk represents a mixed signal spectrum obtained by mixing the i-th desired speech spectrum, the j-th undesired speech spectrum, and the k-th noise spectrum.
 図3は、μYijkとμAi,μBj,μNkとの関係を示す説明図である。図3に示すそれぞれのグラフにおいて、横軸は周波数帯域を、縦軸はスペクトルの大きさを示す。μYijkとμAi,μBj,μNkとは、周波数波数帯域ごとに式1に示す関係を有する。 FIG. 3 is an explanatory diagram showing the relationship between μ Yijk and μ Ai , μ Bj , μ Nk . In each graph shown in FIG. 3, the horizontal axis represents the frequency band, and the vertical axis represents the magnitude of the spectrum. μ Yijk and μ Ai , μ Bj , μ Nk have the relationship shown in Equation 1 for each frequency waveband .
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 所望音声スペクトルの標準パタンがi=1からi=IまでのI個、所望外音声スペクトルの標準パタンがj=1からj=JまでのJ個、雑音スペクトルの標準パタンがk=1からk=KまでのK個あるとき、混合信号スペクトルの標準パタンの数は、I×J×K個となる。つまり、標準パタン格納部103には、このI×J×K個の組み合わせが格納される。 The standard pattern of desired speech spectrum is I from i = 1 to i = I, the standard pattern of undesired speech spectrum is J from j = 1 to j = J, and the standard pattern of noise spectrum is k = 1 to k When there are K up to = K, the number of standard patterns of the mixed signal spectrum is I × J × K. That is, the standard pattern storage unit 103 stores the I × J × K combinations.
 これらの標準パタンは、予め、統計処理や機械学習の手法を用いて用意され、標準パタン格納部103に格納される。所望外音声スペクトルの標準パタンは、例えば、不特定話者の音声データにもとづいて作成される。所望音声スペクトルの標準パタンは、例えば、特定話者の音声データにもとづいて作成される。 These standard patterns are prepared in advance by using statistical processing and machine learning techniques, and are stored in the standard pattern storage unit 103. The standard pattern of the undesired voice spectrum is created based on voice data of an unspecified speaker, for example. The standard pattern of the desired speech spectrum is created based on, for example, speech data of a specific speaker.
 所望音声スペクトル期待値算出部104は、距離計算部102が算出した距離を用いて、出力情報として所望音声スペクトルの期待値を算出する。 The desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum as output information using the distance calculated by the distance calculation unit 102.
 なお、スペクトル変換部101、距離計算部102および所望音声スペクトル期待値算出部104は、音声処理装置100が備えるCPU(Central Processing Unit)によって実現される。また、標準パタン格納部103は、音声処理装置100が備えるメモリ等の記憶装置によって実現される。 The spectrum conversion unit 101, the distance calculation unit 102, and the desired speech spectrum expected value calculation unit 104 are realized by a CPU (Central Processing Unit) provided in the speech processing apparatus 100. Further, the standard pattern storage unit 103 is realized by a storage device such as a memory provided in the sound processing device 100.
 次に、本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.
 図4は、音声処理装置の第1の実施形態の動作を示すフローチャートである。 FIG. 4 is a flowchart showing the operation of the first embodiment of the speech processing apparatus.
 図4に示すように、まず、スペクトル変換部101は、音声処理装置100が入力した混合信号に対して、短時間スペクトル変換を行う。スペクトル変換部101は、例えば、10ミリ秒などの単位時間ごとに、周波数帯域ごとのスペクトルを得る(ステップS101)。本実施形態では、音声処理装置100は、ステップS102~S103の処理を、単位時間ごとに取得した周波数帯域ごとのスペクトル、または、複数周波数帯域を束ねたサブバンドスペクトルを要素に持つベクトルに対して行うものとする。混合信号のスペクトルベクトルをYとする。 As shown in FIG. 4, first, the spectrum conversion unit 101 performs short-time spectrum conversion on the mixed signal input by the speech processing apparatus 100. The spectrum conversion unit 101 obtains a spectrum for each frequency band, for example, every unit time such as 10 milliseconds (step S101). In the present embodiment, the speech processing apparatus 100 performs the processes in steps S102 to S103 on a vector having as an element a spectrum for each frequency band acquired every unit time or a subband spectrum in which a plurality of frequency bands are bundled. Assumed to be performed. Let Y be the spectral vector of the mixed signal.
 次に、距離計算部102は、混合信号のスペクトルベクトルYと、標準パタン格納部103に格納されている混合信号スペクトルの標準パタンμYijkとの距離dijkを計算する(ステップS102)。距離の算出は、標準パタンの組の数(I×J×K個)に対して行われる。距離計算部102は、距離として、一般的なベクトル間の距離、例えば、ユークリッド距離などを求めるようにしてもよい。また、距離計算部102は、スペクトルベクトルを対数スペクトルベクトルに変換してから距離を求めるようにしてもよい。また、距離計算部102は、ケプストラム(cepstrum)などの音声認識で一般に使われている特徴量に変換してから距離を求めるようにしてもよい。距離dijkはスカラ量である。 Next, the distance calculation unit 102 calculates a distance d ijk between the spectrum vector Y of the mixed signal and the standard pattern μ Yijk of the mixed signal spectrum stored in the standard pattern storage unit 103 (step S102). The distance is calculated for the number of standard pattern groups (I × J × K). The distance calculation unit 102 may obtain a general distance between vectors, for example, a Euclidean distance as the distance. The distance calculation unit 102 may obtain the distance after converting the spectrum vector into a logarithmic spectrum vector. Further, the distance calculation unit 102 may obtain the distance after converting into a feature amount generally used in speech recognition such as a cepstrum. The distance d ijk is a scalar quantity.
 次に、所望音声スペクトル期待値算出部104は、所望音声スペクトルの期待値を計算する(ステップS103)。所望音声スペクトルの期待値Eは、所望音声スペクトルの標準パタンμAiの重み付き平均で算出される。Eはベクトル量である。重みは距離dijkが大きいと小さな重みとなり、dijkが小さいと大きな重みとなるように設定される。式2は、Eの計算式の一例である。 Next, desired speech spectrum expected value calculation section 104 calculates an expected value of the desired speech spectrum (step S103). Desired expected value E A of the speech spectrum is calculated by a weighted average of the standard patterns mu Ai desired speech spectrum. E A is a vector quantity. The weight is set to be small when the distance d ijk is large and to be large when dijk is small. Equation 2 is an example of a formula for E A.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 音声処理装置100、具体的には、音声処理装置100が備える出力部(図示せず)が、出力情報であるEにもとづいて、音声認識結果を出力する。このとき、音声処理装置100は、Eを所望音声の推定値として出力してもよいし、逆フーリエ変換などを行い、波形信号(wave signal)に変換して出力してもよいし、波形信号に対して音声認識(speech recognition)を行い、テキストに変換して出力してもよい。 Speech processing apparatus 100, specifically, the output unit the speech processing apparatus 100 (not shown), based on the E A is the output information and outputs the voice recognition results. At this time, the audio processing apparatus 100 may output a E A as an estimate of the desired sound, perform such inverse Fourier transform, may be converted and output into waveform signals (wave Signal), waveform Speech recognition (speech recognition) may be performed on the signal, and the signal may be converted into text and output.
 ここでは、標準パタン格納部103に、混合信号スペクトルの標準パタンと、所望音声スペクトルの標準パタンと、所望外音声スペクトルの標準パタンと、雑音スペクトルの標準パタンの4つの標準パタンの組が格納される例を示したが、他の格納の仕方も考えられる。 Here, the standard pattern storage unit 103 stores a set of four standard patterns: a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum. However, other storage methods are possible.
 例えば、標準パタン格納部103に、混合信号スペクトルの標準パタンと所望音声スペクトルの標準パタンとの組(図2に示す標準パタンの組A)のみが格納されていてもよい。これは、実際に入力信号から所望音声スペクトルを取り出すときには、この2つの標準パタンのみ必要となるためである。他2つの標準パタンは、事前に混合信号の標準パタンを作成するときのみ必要となる。この場合においても、I×J×K個の標準パタンの組が標準パタン格納部103に格納される。具体的には、1個の所望音声スペクトルの標準パタンに対し、J×K個の混合信号スペクトルの標準パタンの組が格納される。 For example, the standard pattern storage unit 103 may store only a set of a standard pattern of a mixed signal spectrum and a standard pattern of a desired speech spectrum (a set A of standard patterns shown in FIG. 2). This is because only these two standard patterns are necessary when the desired speech spectrum is actually extracted from the input signal. The other two standard patterns are required only when the standard pattern of the mixed signal is created in advance. Also in this case, a set of I × J × K standard patterns is stored in the standard pattern storage unit 103. Specifically, a set of standard patterns of J × K mixed signal spectra is stored for one standard pattern of desired speech spectrum.
 また、標準パタン格納部103に、所望音声スペクトルの標準パタンと所望外音声スペクトルの標準パタンと雑音スペクトルの標準パタンとの組(図2に示す標準パタンの組B)のみが格納されていてもよい。これは、入力信号から所望音声を取り出すときに、式1を用いて混合信号スペクトルの標準パタンを算出できるためである。 Further, even if only the set of the standard pattern of the desired speech spectrum, the standard pattern of the undesired speech spectrum and the standard pattern of the noise spectrum (standard pattern set B shown in FIG. 2) is stored in the standard pattern storage unit 103. Good. This is because the standard pattern of the mixed signal spectrum can be calculated using Equation 1 when extracting the desired sound from the input signal.
 以上に説明したように、本実施形態によれば、音声処理装置は、所望話者からの音声と、所望話者以外からの音声と、雑音が混在する信号から、所望話者からの音声のみを正確に取得することができる。 As described above, according to the present embodiment, the speech processing apparatus can only perform speech from a desired speaker from a speech mixed with speech from a desired speaker, speech from other than the desired speaker, and noise. Can be obtained accurately.
実施形態2.
 以下、本発明の第2の実施形態を図面を参照して説明する。
Embodiment 2. FIG.
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.
 図5は、本発明による音声処理装置の第2の実施形態の構成を示すブロック図である。 FIG. 5 is a block diagram showing the configuration of the second embodiment of the speech processing apparatus according to the present invention.
 図5に示すように、本実施形態の音声処理装置200は、第1の実施形態の音声処理装置100の構成に加え、混合信号スペクトル期待値算出部201と、所望音声強調フィルタ算出部202と、所望音声スペクトル推定部203とを含む。 As shown in FIG. 5, in addition to the configuration of the speech processing apparatus 100 of the first embodiment, the speech processing apparatus 200 of the present embodiment includes a mixed signal spectrum expected value calculation unit 201, a desired speech enhancement filter calculation unit 202, And a desired speech spectrum estimation unit 203.
 混合信号スペクトル期待値算出部201は、距離計算部102が算出した距離を用いて、混合信号スペクトルの期待値を算出する。 The mixed signal spectrum expected value calculation unit 201 calculates the expected value of the mixed signal spectrum using the distance calculated by the distance calculation unit 102.
 所望音声強調フィルタ算出部202は、所望音声スペクトルの期待値と、混合信号スペクトルの期待値とを用いて、所望音声強調フィルタを算出する。 The desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum and the expected value of the mixed signal spectrum.
 所望音声スペクトル推定部203は、スペクトル変換部101が取得した混合信号のスペクトルベクトルと、所望音声強調フィルタとをもとに、出力情報として所望音声スペクトルの推定値を算出する。 The desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum as output information based on the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the desired speech enhancement filter.
 なお、混合信号スペクトル期待値算出部201、所望音声強調フィルタ算出部202および所望音声スペクトル推定部203は、音声処理装置200が備えるCPUによって実現される。 Note that the mixed signal spectrum expected value calculation unit 201, the desired speech enhancement filter calculation unit 202, and the desired speech spectrum estimation unit 203 are realized by a CPU provided in the speech processing apparatus 200.
 なお、スペクトル変換部101、距離計算部102、標準パタン格納部103および所望音声スペクトル期待値算出部104は、第1の実施形態と同様であるため、説明を省略する。 The spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.
 次に、本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.
 図6は、音声処理装置の第2の実施形態の動作を示すフローチャートである。 FIG. 6 is a flowchart showing the operation of the second embodiment of the speech processing apparatus.
 ステップS201~S203の処理は、第1の実施形態のステップS101~S103の処理と同様であるため、説明を省略する。 Since the processing of steps S201 to S203 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.
 ステップS203の後、混合信号スペクトル期待値算出部201は、混合信号スペクトルの期待値を算出する(ステップS204)。混合信号スペクトルの期待値Eは、混合信号スペクトルの標準パタンμYijkの重み付き平均で算出される。Eはベクトル量である。ステップS204における計算方法は、第1の実施形態のステップS103と同様である。式3は、Eの計算式の一例である。 After step S203, the mixed signal spectrum expected value calculation unit 201 calculates an expected value of the mixed signal spectrum (step S204). The expected value E Y of the mixed signal spectrum is calculated as a weighted average of the standard pattern μ Yijk of the mixed signal spectrum. E Y is a vector quantity. The calculation method in step S204 is the same as that in step S103 of the first embodiment. Equation 3 is an example of a formula for E Y.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 次に、所望音声強調フィルタ算出部202は、所望音声強調フィルタを算出する(ステップS205)。式4は、所望音声強調フィルタWの計算式の一例である。式4における除算は、ベクトルの要素ごとに行われる。Wは、ベクトル量である。 Next, the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S205). Formula 4 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 4 is performed for each vector element. W is a vector quantity.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 次に、所望音声スペクトル推定部203は、所望音声スペクトルの推定値を算出する(ステップS206)。所望音声スペクトルの推定値Fの算出は、所望音声強調フィルタWと混合信号スペクトルベクトルYを要素ごとに乗算して行われる。Fはベクトル量である。式5は、Fの計算式の一例である。 Next, the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S206). Desired calculation of the speech spectrum estimate F A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element. F A is a vector quantity. Equation 5 is an example of a formula for F A.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 ここで、式5における記号「×」は、ベクトルの要素ごとの乗算を示す。なお、音声処理装置200は、出力情報である所望音声スペクトルの推定値をそのまま出力としてもよいし、逆フーリエ変換(inverse Fourier transformation)を施し、波形信号(wave signal)の形に戻して出力してもよいし、波形信号に対して音声認識(speech recognition)を行い、テキストに変換して出力してもよい。 Here, the symbol “x” in Equation 5 indicates multiplication for each element of the vector. The speech processing apparatus 200 may output the estimated value of the desired speech spectrum, which is output information, as it is, or performs inverse Fourier transform to return the waveform signal (wave signal) to the output form. Alternatively, speech recognition (speech recognition) may be performed on the waveform signal, and the waveform signal may be converted into text and output.
 ここでは、所望音声強調フィルタの例として式4を示したが、所望音声強調フィルタ算出部202は、例えば、所望音声スペクトル期待値Eおよび混合信号スペクトル期待値Eの値を複数単位時間に渡り平均もしくは平滑化するようにしてもよい。または、式4を用いて算出されたWの値を複数単位時間に渡り平均もしくは平滑化するようにしてもよい。 Here, although the expression 4 as an example of a desired speech enhancement filter, the desired speech enhancement filter calculating section 202, for example, a plurality unit time value of the desired speech spectrum expectation E A and mixed signal spectrum expectation E Y Cross averaging or smoothing may be performed. Alternatively, the W value calculated using Equation 4 may be averaged or smoothed over a plurality of unit times.
 以上に説明したように、本実施形態では、所望音声強調フィルタ算出の段階で平均や平滑化の処理を行っている。そのため、瞬間的な推定の誤りに頑健になる。従って、第1の実施形態の効果に加え、より信頼のおける出力を得ることができる。 As described above, in the present embodiment, averaging and smoothing are performed at the stage of desired speech enhancement filter calculation. This makes it robust against instantaneous estimation errors. Therefore, in addition to the effects of the first embodiment, a more reliable output can be obtained.
実施形態3.
 以下、本発明の第3の実施形態を図面を参照して説明する。
Embodiment 3. FIG.
Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.
 図7は、本発明による音声処理装置の第3の実施形態の構成を示すブロック図である。 FIG. 7 is a block diagram showing the configuration of the third embodiment of the speech processing apparatus according to the present invention.
 図7に示すように、本実施形態の音声処理装置300は、第1の実施形態の音声処理装置100の構成に加え、所望外音声スペクトル期待値算出部301と、雑音スペクトル期待値算出部302とを含む。さらに、音声処理装置300は、所望音声強調フィルタ算出部202と、所望音声スペクトル推定部203とを含む。 As shown in FIG. 7, in addition to the configuration of the speech processing apparatus 100 of the first embodiment, the speech processing apparatus 300 of the present embodiment includes an undesired speech spectrum expected value calculation unit 301 and a noise spectrum expected value calculation unit 302. Including. Furthermore, the speech processing apparatus 300 includes a desired speech enhancement filter calculation unit 202 and a desired speech spectrum estimation unit 203.
 所望外音声スペクトル期待値算出部301は、距離計算部102が算出した距離を用いて、所望外音声スペクトルの期待値を算出する。 The undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum using the distance calculated by the distance calculation unit 102.
 雑音スペクトル期待値算出部302は、距離計算部102が算出した距離を用いて、雑音スペクトルの期待値を算出する。 The expected noise spectrum value calculation unit 302 uses the distance calculated by the distance calculation unit 102 to calculate the expected value of the noise spectrum.
 なお、所望外音声スペクトル期待値算出部301および雑音スペクトル期待値算出部302は、音声処理装置300が備えるCPUによって実現される。 The undesired speech spectrum expected value calculation unit 301 and the noise spectrum expected value calculation unit 302 are realized by a CPU included in the speech processing apparatus 300.
 なお、スペクトル変換部101、距離計算部102、標準パタン格納部103および所望音声スペクトル期待値算出部104は、第1の実施形態と同様であるため、説明を省略する。 The spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.
 また、所望音声強調フィルタ算出部202および所望音声スペクトル推定部203は、第2の実施形態と同様であるため、説明を省略する。 Further, the desired speech enhancement filter calculation unit 202 and the desired speech spectrum estimation unit 203 are the same as those in the second embodiment, and thus description thereof is omitted.
 ただし、本実施形態では、所望音声強調フィルタ算出部202は、所望音声スペクトルの期待値と、所望外音声スペクトルの期待値と、雑音スペクトルの期待値とを用いて、所望音声強調フィルタを算出する。 However, in the present embodiment, the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum, the expected value of the undesired speech spectrum, and the expected value of the noise spectrum. .
 次に、本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.
 図8は、音声処理装置の第3の実施形態の動作を示すフローチャートである。 FIG. 8 is a flowchart showing the operation of the third embodiment of the speech processing apparatus.
 ステップS301~S303の処理は、第1の実施形態のステップS101~S103の処理と同様であるため、説明を省略する。 Since the processing of steps S301 to S303 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.
 ステップS303の後、所望外音声スペクトル期待値算出部301は、所望外音声スペクトルの期待値を計算する(ステップS304)。所望外音声スペクトルの期待値Eは、所望外音声スペクトルの標準パタンμBjの重み付き平均で算出される。Eはベクトル量である。ステップS304における計算方法は、第1の実施形態のステップS103と同様である。式6は、Eの計算式の一例である。 After step S303, the undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum (step S304). Expected value E B of the desired outside the speech spectrum is calculated by the weighted average of the standard patterns mu Bj desired outside the speech spectrum. E B is a vector quantity. The calculation method in step S304 is the same as that in step S103 in the first embodiment. Equation 6 is an example of a formula for E B.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 次に、雑音スペクトル期待値算出部302は、雑音スペクトルの期待値を計算する(ステップS305)。雑音スペクトルの期待値Eは、雑音スペクトルの標準パタンμNkの重み付き平均で算出される。Eはベクトル量である。ステップS305における計算方法は、第1の実施形態のステップS103と同様である。式7は、Eの計算式の一例である。 Next, the expected noise spectrum value calculation unit 302 calculates the expected value of the noise spectrum (step S305). The expected value E N of the noise spectrum is calculated as a weighted average of the standard pattern μ Nk of the noise spectrum. E N is a vector quantity. The calculation method in step S305 is the same as that in step S103 of the first embodiment. Equation 7 is an example of a formula for E N.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 次に、所望音声強調フィルタ算出部202は、所望音声強調フィルタを算出する(ステップS306)。式8は、所望音声強調フィルタWの計算式の一例である。式8における除算はベクトルの要素ごとに行われる。Wは、ベクトル量である。 Next, the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S306). Formula 8 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 8 is performed for each element of the vector. W is a vector quantity.
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 次に、所望音声スペクトル推定部203は、所望音声スペクトルの推定値を算出する(ステップS307)。所望音声スペクトルの推定値Fの算出は、所望音声強調フィルタWと混合信号スペクトルベクトルYとを要素ごとに乗算して行われる。Fはベクトル量である。 Next, the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S307). Desired calculation of the speech spectrum estimate F A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element. F A is a vector quantity.
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 ここで、式9の記号「×」はベクトルの要素ごとの乗算を示す。なお、音声処理装置300は、出力情報である所望音声スペクトルの推定値をそのまま出力としてもよいし、逆フーリエ変換を施し、波形信号の形に戻して出力としてもよいし、波形信号に対して音声認識(speech recognition)を行い、テキストに変換して出力してもよい。 Here, the symbol “x” in Equation 9 indicates multiplication for each element of the vector. The speech processing apparatus 300 may output the estimated value of the desired speech spectrum, which is output information, as it is, may perform inverse Fourier transform, return to the waveform signal form, or output the waveform signal. Speech recognition may be performed, converted into text, and output.
 ここでは、所望音声強調フィルタの例として式8を示したが、所望音声強調フィルタ算出部202は、例えば、所望音声スペクトル期待値Eおよび所望外信号スペクトル期待値Eおよび雑音信号スペクトル期待値Eの値を複数単位時間に渡り平均もしくは平滑化するようにしてもよい。または、式8を用いて算出されたWの値を複数単位時間に渡り平均もしくは平滑化するようにしてもよい。 Here, although the expression 8 as an example of a desired speech enhancement filter, the desired speech enhancement filter calculating section 202, for example, the desired speech spectrum expectation E A and the desired extracellular signal spectrum expectation E B and the noise signal spectrum expected value the value of E N may be averaged or smoothed over a plurality of units of time. Alternatively, the value of W calculated using Equation 8 may be averaged or smoothed over a plurality of unit times.
 以上に説明したように、本実施形態では、第2の実施形態と同様に、所望音声強調フィルタ算出の段階で平均や平滑化の操作を行っている。そのため、瞬間的な推定の誤りに頑健になる。従って、第1の実施形態の効果に加え、より信頼のおける出力を得ることができる。 As described above, in this embodiment, as in the second embodiment, averaging and smoothing operations are performed at the stage of desired speech enhancement filter calculation. This makes it robust against instantaneous estimation errors. Therefore, in addition to the effects of the first embodiment, a more reliable output can be obtained.
実施形態4.
 以下、本発明の第4の実施形態を図面を参照して説明する。
Embodiment 4 FIG.
Hereinafter, a fourth embodiment of the present invention will be described with reference to the drawings.
 図9は、本発明による音声処理装置の第4の実施形態の構成を示すブロック図である。 FIG. 9 is a block diagram showing the configuration of the fourth embodiment of the speech processing apparatus according to the present invention.
 図9に示すように、本実施形態の音声処理装置400は、第1の実施形態の音声処理装置100の構成に加え、信頼度算出部401と、マスク設定部402と、マスク付与部403とを含む。 As shown in FIG. 9, in addition to the configuration of the speech processing apparatus 100 of the first embodiment, the speech processing apparatus 400 of the present embodiment includes a reliability calculation unit 401, a mask setting unit 402, and a mask applying unit 403. including.
 信頼度算出部401は、距離計算部102が算出した距離を用いて、信頼度を算出する。ここでは、信頼度の例としてエントロピーをあげる。式10は、エントロピーHの計算式の一例である。エントロピーは乱雑さの指標であり、この値が大きいほど信頼度が低い。エントロピーはスカラ量である。 The reliability calculation unit 401 calculates the reliability using the distance calculated by the distance calculation unit 102. Here, entropy is given as an example of reliability. Equation 10 is an example of a formula for calculating entropy H. Entropy is an indicator of randomness, and the greater the value, the lower the reliability. Entropy is a scalar quantity.
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
 また、信頼度の算出方法の別の例として、入力した混合信号スペクトルYと、混合信号スペクトルの期待値Eとの距離を用いる方法が考えられる。この方法で算出される指標も同様に、値が大きいほど信頼度が低い。 Another example of a method of calculating the reliability, a mixed signal spectrum Y input, a method using the distance between the expected value E Y of the mixed signal spectrum is considered. Similarly, the index calculated by this method has a lower reliability as the value increases.
 マスク設定部402は、信頼度に応じて、マスクの値を設定する。マスク設定部402は、信頼度が低い場合には大きなマスクの値を設定し、信頼度が高い場合には小さなマスクの値を設定する。例えば、エントロピーHの値そのもの、または、エントロピーHに正の乗数を掛けた量をマスクMとして用いることが考えられる。マスクはベクトル量である。 The mask setting unit 402 sets a mask value according to the reliability. The mask setting unit 402 sets a large mask value when the reliability is low, and sets a small mask value when the reliability is high. For example, the value of the entropy H itself or an amount obtained by multiplying the entropy H by a positive multiplier can be used as the mask M. A mask is a vector quantity.
 マスクの値は、周波数帯域ごとに全て同じ値であってもよいし、別の値であってもよい。ここでは、マスク設定部402は、マスクの値を全て同じ値に設定するものとする。式11は、マスクMとエントロピーHとの関係の一例を示す式である。 The mask values may all be the same value for each frequency band, or may be different values. Here, it is assumed that the mask setting unit 402 sets all the mask values to the same value. Expression 11 is an expression showing an example of the relationship between the mask M and the entropy H.
Figure JPOXMLDOC01-appb-M000011
Figure JPOXMLDOC01-appb-M000011
 マスク付与部403は、マスクを加算した所望音声スペクトルの推定値F´を、出力情報として算出する。式12は、F´の計算式の一例である。 The mask assigning unit 403 calculates an estimated value F A ′ of the desired speech spectrum added with the mask as output information. Formula 12 is an example of a formula for calculating F A ′.
Figure JPOXMLDOC01-appb-M000012
Figure JPOXMLDOC01-appb-M000012
 なお、信頼度算出部401、マスク設定部402およびマスク付与部403は、音声処理装置400が備えるCPUによって実現される。 Note that the reliability calculation unit 401, the mask setting unit 402, and the mask applying unit 403 are realized by a CPU provided in the voice processing device 400.
 以上に説明したように、本実施形態によれば、第1の実施形態の効果に加え、推定された所望音声スペクトルの推定値の信頼度が低く、他の信号が混入している可能性が高いときに、他の信号の悪影響を抑えることができる。 As described above, according to the present embodiment, in addition to the effects of the first embodiment, the reliability of the estimated value of the estimated desired speech spectrum is low and other signals may be mixed. When it is high, the adverse effects of other signals can be suppressed.
 また、本実施形態では、信頼度算出部401が、事前に用意した標準パタンとの距離を用いて信頼度を算出している。従って、本実施形態における信頼度は、特許文献2に記載されている方法で用いられている信頼度に比べてより正確な値であることが期待できる。 In the present embodiment, the reliability calculation unit 401 calculates the reliability using a distance from a standard pattern prepared in advance. Therefore, it can be expected that the reliability in the present embodiment is a more accurate value than the reliability used in the method described in Patent Document 2.
実施形態5.
 以下、本発明の第5の実施形態を図面を参照して説明する。
Embodiment 5. FIG.
Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings.
 第5の実施形態の構成は、第1の実施形態と同様であるため説明を省略する。なお、第5の実施形態の構成は、他の実施形態と同様であってもよい。 Since the configuration of the fifth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the fifth embodiment may be the same as that of the other embodiments.
 図10は、第5の実施形態における標準パタン格納部103が格納する情報の一例を示す説明図である。 FIG. 10 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103 according to the fifth embodiment.
 標準パタン格納部103は、本実施形態では、図10に示すように、混合信号スペクトルの標準パタンとして、平均値ではなく、確率密度関数(Probability Density Function)を格納する。具体的には、標準パタン格納部103は、混合信号スペクトルの標準パタンとして、音素単位もしくはスペクトルの形状が似ているものをクラスタリングした単位ごとに求めたスペクトルの出現頻度を表す確率密度関数を格納する。 In this embodiment, the standard pattern storage unit 103 stores a probability density function (Probability Density Function) instead of an average value as a standard pattern of a mixed signal spectrum, as shown in FIG. Specifically, the standard pattern storage unit 103 stores a probability density function representing the frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes as the standard pattern of the mixed signal spectrum. To do.
 確率密度関数の例としては、例えば、ガウス分布関数(Gaussian Distribution Function)や、混合ガウス分布関数(Gaussian Mixture Distribution Function)などが考えられる。ガウス分布関数を用いた場合には、平均値に加えて、分散の値が用いられる。 As an example of the probability density function, for example, a Gaussian distribution function, a mixed Gaussian distribution function, or the like can be considered. When a Gaussian distribution function is used, a variance value is used in addition to the average value.
 混合信号スペクトルの標準パタンが平均値から確率密度関数に置き換わったことに対応し、距離計算部102は、距離尺度として、バタチャリヤ距離、マハラノビス距離、尤度、対数尤度などを用いる。 Corresponding to the fact that the standard pattern of the mixed signal spectrum is replaced with the probability density function from the average value, the distance calculation unit 102 uses the Batacharya distance, the Mahalanobis distance, the likelihood, the log likelihood, and the like as the distance scale.
 本実施形態によれば、平均に加えて分散などの高次の統計量を用いて混合信号スペクトルの標準パタンを表すことができるため、より精度の高い所望音声の推定が期待できる。 According to the present embodiment, since the standard pattern of the mixed signal spectrum can be expressed using a higher-order statistic such as variance in addition to the average, it is possible to estimate the desired speech with higher accuracy.
実施形態6.
 以下、本発明の第6の実施形態を説明する。
Embodiment 6. FIG.
The sixth embodiment of the present invention will be described below.
 第6の実施形態の構成は、第1の実施形態と同様なため説明を省略する。なお、第6の実施形態の構成は、他の実施形態と同様であってもよい。 Since the configuration of the sixth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the sixth embodiment may be the same as that of the other embodiments.
 本実施形態では、標準パタン格納部103に格納されている標準パタンの組の格納の仕方が異なる。 In this embodiment, the way of storing a set of standard patterns stored in the standard pattern storage unit 103 is different.
 第1の実施形態では、標準パタン格納部103は、図2で示すように、所望音声スペクトルの標準パタンI個、所望外音声スペクトルの標準パタンJ個、雑音スペクトルの標準パタンK個の3種類の標準パタンをそれぞれ1個ずつ組み合わせて作成されたI×J×K個の混合信号スペクトルの標準パタンを格納する。 In the first embodiment, as shown in FIG. 2, the standard pattern storage unit 103 includes three types of standard patterns I of desired speech spectrum, J standard patterns of undesired speech spectrum, and K standard patterns of noise spectrum. The standard patterns of I × J × K mixed signal spectra created by combining one standard pattern each are stored.
 これに対し、本実施形態では、標準パタン格納部103は、I×J×K個の混合信号スペクトルの標準パタンに対してスペクトルの形状の近いものをクラスタリングし、纏め上げることにより作成されたL個の混合信号スペクトルの標準パタンを格納する。 On the other hand, in the present embodiment, the standard pattern storage unit 103 creates an L created by clustering and summing up spectrum patterns having similar shapes to the standard patterns of I × J × K mixed signal spectra. Stores the standard pattern of the mixed signal spectrum.
 また、混合信号スペクトルの標準パタンの纏め上げにより、対応する所望音声スペクトルの標準パタン、所望外音声スペクトルの標準パタン、雑音スペクトルの標準パタンの纏め上げが行われる。纏め上げの例を式13に示す。 Also, by combining the standard patterns of the mixed signal spectrum, the standard patterns of the corresponding desired speech spectrum, the standard pattern of the undesired speech spectrum, and the standard pattern of the noise spectrum are collected. An example of grouping is shown in Equation 13.
Figure JPOXMLDOC01-appb-M000013
Figure JPOXMLDOC01-appb-M000013
 例えば、μ(i=1,j=2,k=1)とμ(i=1,j=7,k=4)の2つの混合信号スペクトルの標準パタンを纏め上げ、μ(L=10)とした場合を考える。このとき、対応する所望音声スペクトルの標準パタン、所望外音声スペクトルの標準パタン、雑音信号スペクトルの標準パタンは、式13に示すように、それぞれの平均として計算される。 For example, standard patterns of two mixed signal spectra of μ Y (i = 1, j = 2, k = 1) and μ Y (i = 1, j = 7, k = 4) are collected and μ Y (L Suppose that = 10). At this time, the standard pattern of the corresponding desired speech spectrum, the standard pattern of the undesired speech spectrum, and the standard pattern of the noise signal spectrum are calculated as averages as shown in Expression 13.
 本実施形態では、標準パタンをまとめ上げることにより、標準パタン格納部103に格納する標準パタンの個数を変更している。従って、音声処理装置の計算リソースやメモリリソースなどに応じて、標準パタンの個数を変更することができる。 In the present embodiment, the number of standard patterns stored in the standard pattern storage unit 103 is changed by collecting standard patterns. Therefore, the number of standard patterns can be changed according to the calculation resources, memory resources, and the like of the voice processing device.
実施形態7.
 以下、本発明の第7の実施形態を説明する。
Embodiment 7. FIG.
The seventh embodiment of the present invention will be described below.
 第7の実施形態の構成は、第1の実施形態と同様であるため説明を省略する。なお、第7の実施形態の構成は、他の実施形態と同様であってもよい。 Since the configuration of the seventh embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the seventh embodiment may be the same as that of the other embodiments.
 本実施形態では、スペクトル変換部101は、音声処理装置が出力した信号を再び入力する。そして、音声処理装置は、ステップS101~S103の処理を繰り返す。 In the present embodiment, the spectrum conversion unit 101 inputs the signal output from the voice processing device again. Then, the sound processing device repeats the processes of steps S101 to S103.
 なお、音声処理装置は、処理を繰り返すときに、最初の処理では標準パタンの数を少なくし、回数を繰り返すにつれて標準パタンの数を多くするようにしてもよい。その場合、所望音声の推定値は、標準パタンの数が少ないときは、粗いスペクトル形状の平均となり、標準パタンの数が多いときは、より細かい単位でのスペクトル形状の平均となる。 Note that, when the processing is repeated, the speech processing apparatus may reduce the number of standard patterns in the first processing and increase the number of standard patterns as the number of times is repeated. In this case, the estimated value of the desired speech is an average of coarse spectral shapes when the number of standard patterns is small, and an average of spectral shapes in finer units when the number of standard patterns is large.
 本実施形態によれば、粗い推定からより精度が高い細かい推定へと所望音声スペクトルの推定の精度を向上させることができる。 According to the present embodiment, it is possible to improve the accuracy of estimation of a desired speech spectrum from coarse estimation to fine estimation with higher accuracy.
実施形態8.
 以下、本発明の第8の実施形態を説明する。
Embodiment 8. FIG.
The eighth embodiment of the present invention will be described below.
 第8の実施形態の構成は、第1の実施形態と同様であるため説明を省略する。なお、第8の実施形態の構成は、他の実施形態と同様であってもよい。 Since the configuration of the eighth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the eighth embodiment may be the same as that of the other embodiments.
 本実施形態では、スペクトル変換部101は、複数のマイクからの信号を入力する。そして、音声処理装置は、スペクトル変換部101が入力した複数の信号の各々に対して、ステップS101~S103の処理を並列に行う。 In the present embodiment, the spectrum conversion unit 101 inputs signals from a plurality of microphones. Then, the sound processing apparatus performs the processes of steps S101 to S103 in parallel for each of the plurality of signals input by the spectrum conversion unit 101.
 本実施形態によれば、複数の所望音声スペクトルの推定値を得ることができる。 According to the present embodiment, it is possible to obtain a plurality of estimated values of desired speech spectrum.
 なお、音声処理装置は、得られた複数の推定値のうち最も信頼度が高いものを出力としてもよいし、得られた複数の推定値の平均値を出力としてもよいし、得られた複数の推定値の最大値を出力としてもよい。 Note that the speech processing apparatus may output the estimated value having the highest reliability among the obtained estimated values, may output the average value of the obtained estimated values, or may obtain the obtained plural values. The maximum value of the estimated values may be output.
 また、スペクトル変換部101は、音源分離の処理を行った後の信号を入力するようにしてもよい。 Further, the spectrum conversion unit 101 may input a signal after performing the sound source separation process.
 以上、本発明の各実施形態について詳述したが、それぞれの実施形態における構成要素はあくまで例示であり、本発明の技術範囲をそれらのみに限定する趣旨のものではない。従って、それぞれの実施形態に含まれる別々の特徴を如何様に組み合わせたシステム又は装置も、本発明の範疇に含まれる。 As mentioned above, although each embodiment of this invention was explained in full detail, the component in each embodiment is an illustration to the last, and is not the meaning which limits the technical scope of this invention only to them. Therefore, a system or apparatus in any combination of the separate features included in each embodiment is also included in the scope of the present invention.
 また、本発明は、複数の機器から構成されるシステムに適用されてもよいし、単体の装置に適用されてもよい。さらに、本発明は、実施形態の機能を実現する制御プログラム(音声処理プログラム)が、システム或いは装置に直接または遠隔から供給される場合にも適用可能である。したがって、本発明の機能をコンピュータで実現するために、コンピュータにインストールされる制御プログラム、或いはその制御プログラムを格納した媒体、その制御プログラムをダウンロードさせるWWW(World Wide Web)サーバも、本発明の範疇に含まれる。 Further, the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where a control program (speech processing program) that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention with a computer, a control program installed in the computer, a medium storing the control program, and a WWW (World Wide Web) server for downloading the control program are also included in the scope of the present invention. include.
 図11は、本発明の音声処理装置の主要部を示すブロック図である。本発明の音声処理装置は、距離計算部102と、所望音声スペクトル期待値算出部104とを備える。 FIG. 11 is a block diagram showing the main part of the speech processing apparatus of the present invention. The speech processing apparatus of the present invention includes a distance calculation unit 102 and a desired speech spectrum expected value calculation unit 104.
 距離計算部102は、複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する。 The distance calculation unit 102 includes a plurality of mixed signal spectrum standards calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively. The distance between the input signal and each pattern is calculated.
 所望音声スペクトル期待値算出部104は、距離計算部102が算出した距離を用いて入力信号中の所望音声スペクトルの期待値を算出する。 The desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum in the input signal using the distance calculated by the distance calculation unit 102.
 また、上記の実施形態には、以下のような音声処理装置も開示されている。 In the above embodiment, the following audio processing apparatus is also disclosed.
(付記1)スペクトル変換部101は、音声処理装置が出力した信号を入力信号として入力する音声処理装置。 (Additional remark 1) The spectrum conversion part 101 is a speech processing apparatus which inputs the signal which the speech processing apparatus output as an input signal.
 そのような形態によれば、入力信号に対する処理を繰り返すときに、最初の処理では標準パタンの数を少なくし、回数を繰り返すにつれて標準パタンの数を多くするようにすることにより、粗い推定からより精度の高い細かい推定へと所望音声スペクトルの推定の精度を向上させることができる。 According to such a form, when the process for the input signal is repeated, the number of standard patterns is reduced in the first process, and the number of standard patterns is increased as the number of repetitions is repeated. It is possible to improve the accuracy of estimation of a desired speech spectrum to a fine estimation with high accuracy.
(付記2)スペクトル変換部101は、複数の入力信号を入力し、音声処理装置の各部は、複数の入力信号に対して並列に処理を実行する音声処理装置。 (Supplementary Note 2) The spectrum conversion unit 101 receives a plurality of input signals, and each unit of the sound processing device performs processing on the plurality of input signals in parallel.
 そのような形態によれば、複数の所望音声スペクトルの推定値を得ることができる。 According to such a form, it is possible to obtain a plurality of estimated values of desired speech spectrum.
 この出願は、2012年3月30日に出願された日本特許出願2012-081586を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2012-081586 filed on March 30, 2012, the entire disclosure of which is incorporated herein.
 以上、実施形態を参照して本願発明を説明したが、本願発明は上記の実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above-described embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 100、200、300、400 音声処理装置
 101 スペクトル変換部
 102 距離計算部
 103 標準パタン格納部
 104 所望音声スペクトル期待値算出部
 201 混合信号スペクトル期待値算出部
 202 所望音声強調フィルタ算出部
 203 所望音声スペクトル推定部
 301 所望外音声スペクトル期待値算出部
 302 雑音スペクトル期待値算出部
 401 信頼度算出部
 402 マスク設定部
 403 マスク付与部
100, 200, 300, 400 Speech processing apparatus 101 Spectrum conversion unit 102 Distance calculation unit 103 Standard pattern storage unit 104 Expected speech spectrum expected value calculation unit 201 Mixed signal spectrum expected value calculation unit 202 Desired speech enhancement filter calculation unit 203 Desired speech spectrum Estimating unit 301 Undesired speech spectrum expected value calculating unit 302 Noise spectrum expected value calculating unit 401 Reliability calculating unit 402 Mask setting unit 403 Mask applying unit

Claims (10)

  1.  複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する距離計算部と、
     前記距離を用いて前記入力信号中の所望音声スペクトルの期待値を算出する所望音声スペクトル期待値算出部とを備えた
     ことを特徴とする音声処理装置。
    For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. A distance calculation unit for calculating a distance between the input signal and
    A speech processing apparatus, comprising: a desired speech spectrum expected value calculation unit that calculates an expected value of a desired speech spectrum in the input signal using the distance.
  2.  所望音声スペクトル期待値算出部は、距離計算部が算出した距離を用いて、複数の所望音声スペクトルの標準パタンを重み付き平均し、入力信号中の所望音声スペクトルの期待値を算出する
     請求項1に記載の音声処理装置。
    The desired speech spectrum expected value calculation unit calculates the expected value of the desired speech spectrum in the input signal by weighted averaging the standard patterns of a plurality of desired speech spectra using the distance calculated by the distance calculation unit. The voice processing apparatus according to 1.
  3.  所望の話者により発声された所望音声と、前記所望音声以外の信号とが混合した混合信号を入力信号として入力し、前記入力信号を単位時間ごとにスペクトルに変換するスペクトル変換部と、
     混合信号スペクトルの標準パタンと、対応する所望音声スペクトルの標準パタンとの組を複数格納する標準パタン格納部とを備え、
     距離計算部は、前記入力信号のスペクトルと前記複数の混合信号スペクトルの標準パタンとの距離を算出し、
     所望音声スペクトル期待値算出部は、前記距離を用いて、前記対応する所望音声スペクトルの標準パタンを重み付き平均し、所望音声スペクトルの期待値を出力情報として算出する
     請求項1または請求項2に記載の音声処理装置。
    A spectrum conversion unit that inputs a mixed signal in which a desired voice uttered by a desired speaker and a signal other than the desired voice are mixed as an input signal, and converts the input signal into a spectrum every unit time;
    A standard pattern storage unit for storing a plurality of sets of standard patterns of the mixed signal spectrum and standard patterns of the corresponding desired speech spectrum;
    The distance calculation unit calculates a distance between a spectrum of the input signal and a standard pattern of the plurality of mixed signal spectra,
    The desired speech spectrum expected value calculation unit calculates a weighted average of the standard patterns of the corresponding desired speech spectrum using the distance, and calculates an expected value of the desired speech spectrum as output information. The speech processing apparatus according to the description.
  4.  距離計算部が算出した距離を用いて、混合信号スペクトルの標準パタンを重み付き平均し、混合信号スペクトルの期待値を算出する混合信号スペクトル期待値算出部と、
     所望音声スペクトルの期待値と前記混合信号スペクトルの期待値とから所望音声スペクトル強調フィルタを算出する所望音声スペクトル強調フィルタ算出部と、
     前記所望音声スペクトル強調フィルタと入力信号のスペクトルとから所望音声スペクトルの推定値を出力情報として算出する所望音声スペクトル推定部とを備える
     請求項1から請求項3のうちのいずれか1項に記載の音声処理装置。
    Using the distance calculated by the distance calculation unit, the standard pattern of the mixed signal spectrum is weighted and averaged to calculate the expected value of the mixed signal spectrum;
    A desired speech spectrum enhancement filter calculator that calculates a desired speech spectrum enhancement filter from an expected value of the desired speech spectrum and an expected value of the mixed signal spectrum;
    The desired speech spectrum estimation part which calculates the estimated value of a desired speech spectrum as output information from the spectrum of the said desired speech spectrum emphasis filter and an input signal is given in any 1 paragraph of Claims 1-3. Audio processing device.
  5.  所望外音声スペクトル期待値算出部と、雑音スペクトル期待値算出部とを備え、
     標準パタン格納部は、混合信号スペクトルの標準パタンと、対応する所望音声スペクトルの標準パタンに加えて、所望外音声スペクトルの標準パタンと雑音スペクトルの標準パタンとの組を複数格納し、
     前記所望外音声スペクトル期待値算出部は、距離計算部が算出した距離を用いて、前記所望外音声スペクトルの標準パタンを重み付き平均し、所望外音声スペクトルの期待値を算出し、
     前記雑音スペクトル期待値算出部は、前記距離計算部が算出した距離を用いて、前記雑音スペクトルの標準パタンを重み付き平均し、雑音スペクトルの期待値を算出し、
     所望音声スペクトル強調フィルタ算出部は、前記所望音声スペクトルの期待値と前記所望外音声スペクトルの期待値と前記雑音スペクトルの期待値とから所望音声スペクトル強調フィルタを算出し、
     所望音声スペクトル推定部は、前記所望音声スペクトル強調フィルタと入力信号とから所望音声スペクトルの推定値を出力情報として算出する
     請求項3に記載の音声処理装置。
    An undesired speech spectrum expectation value calculator, and a noise spectrum expectation value calculator,
    The standard pattern storage unit stores a plurality of sets of standard patterns of undesired speech spectrums and standard patterns of noise spectra in addition to standard patterns of mixed signal spectra and corresponding standard patterns of desired speech spectra,
    The undesired speech spectrum expectation value calculation unit uses the distance calculated by the distance calculation unit to perform a weighted average of the standard patterns of the undesired speech spectrum and calculate an expected value of the undesired speech spectrum.
    The noise spectrum expected value calculation unit uses the distance calculated by the distance calculation unit, calculates a weighted average of the standard pattern of the noise spectrum, and calculates an expected value of the noise spectrum,
    The desired speech spectrum enhancement filter calculation unit calculates a desired speech spectrum enhancement filter from the expected value of the desired speech spectrum, the expected value of the non-desired speech spectrum, and the expected value of the noise spectrum,
    The speech processing apparatus according to claim 3, wherein the desired speech spectrum estimation unit calculates an estimated value of a desired speech spectrum as output information from the desired speech spectrum enhancement filter and an input signal.
  6.  距離計算部が算出した距離を用いて信頼度を算出する信頼度算出部と、
     前記信頼度を用いてマスクを設定するマスク設定部と、
     出力情報に前記マスクを加算して、新たな出力情報を生成するマスク付与部とを備える
     請求項1から請求項5のうちのいずれか1項に記載の音声処理装置。
    A reliability calculation unit that calculates reliability using the distance calculated by the distance calculation unit;
    A mask setting unit that sets a mask using the reliability;
    The speech processing apparatus according to claim 1, further comprising: a mask adding unit that adds the mask to the output information and generates new output information.
  7.  標準パタン格納部は、標準パタンとして、音素単位またはスペクトルの形状が似ているものをクラスタリングした単位ごとに算出したスペクトルの平均を格納する
     請求項1から請求項6のうちのいずれか1項に記載の音声処理装置。
    The standard pattern storage unit stores, as the standard pattern, an average of spectra calculated for each unit obtained by clustering phoneme units or similar spectrum shapes. 7. The speech processing apparatus according to the description.
  8.  標準パタン格納部は、標準パタンとして、音素単位またはスペクトルの形状が似ているものをクラスタリングした単位ごとに求めたスペクトルの出現頻度を表す確率密度関数を格納する
     請求項1から請求項6のうちのいずれか1項に記載の音声処理装置。
    The standard pattern storage unit stores, as a standard pattern, a probability density function representing a frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes. The speech processing device according to any one of the above.
  9.  複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出し、
     前記距離を用いて前記入力信号中の所望音声スペクトルの期待値を出力情報として算出する
     ことを特徴とする音声処理方法。
    For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. , Calculate the distance between the input signal and
    An audio processing method, wherein an expected value of a desired audio spectrum in the input signal is calculated as output information using the distance.
  10.  コンピュータに、
     複数の所望音声スペクトルの標準パタンと、複数の所望外音声スペクトルの標準パタンと、複数の雑音スペクトルの標準パタンとをそれぞれ組み合わせることにより算出される複数の混合信号スペクトルの標準パタンのそれぞれに対して、入力信号との間の距離を算出する処理と、
     前記距離を用いて前記入力信号中の所望音声スペクトルの期待値を出力情報として算出する処理とを実行させる
     ための音声処理プログラム。
    On the computer,
    For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. A process of calculating the distance between the input signal and
    A sound processing program for executing a process of calculating an expected value of a desired sound spectrum in the input signal as output information using the distance.
PCT/JP2013/001448 2012-03-30 2013-03-07 Audio processing device, audio processing method, and audio processing program WO2013145578A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012081586 2012-03-30
JP2012-081586 2012-03-30

Publications (1)

Publication Number Publication Date
WO2013145578A1 true WO2013145578A1 (en) 2013-10-03

Family

ID=49258898

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/001448 WO2013145578A1 (en) 2012-03-30 2013-03-07 Audio processing device, audio processing method, and audio processing program

Country Status (2)

Country Link
JP (1) JPWO2013145578A1 (en)
WO (1) WO2013145578A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018031967A (en) * 2016-08-26 2018-03-01 日本電信電話株式会社 Sound source enhancement device, and method and program for the same

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003005785A (en) * 2001-06-26 2003-01-08 National Institute Of Advanced Industrial & Technology Separating method and separating device for sound source
JP2003177783A (en) * 2001-12-11 2003-06-27 Toshiba Corp Voice recognition device, voice recognition system, and voice recognition program
JP2005077731A (en) * 2003-08-29 2005-03-24 Univ Waseda Sound source separating method and system therefor, and speech recognizing method and system therefor
JP2005084071A (en) * 2003-09-04 2005-03-31 Kddi Corp Speech recognizing apparatus
JP2006510060A (en) * 2002-12-13 2006-03-23 三菱電機株式会社 Method and system for separating a plurality of acoustic signals generated by a plurality of acoustic sources
JP2007033920A (en) * 2005-07-27 2007-02-08 Nec Corp System, method, and program for noise suppression

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003005785A (en) * 2001-06-26 2003-01-08 National Institute Of Advanced Industrial & Technology Separating method and separating device for sound source
JP2003177783A (en) * 2001-12-11 2003-06-27 Toshiba Corp Voice recognition device, voice recognition system, and voice recognition program
JP2006510060A (en) * 2002-12-13 2006-03-23 三菱電機株式会社 Method and system for separating a plurality of acoustic signals generated by a plurality of acoustic sources
JP2005077731A (en) * 2003-08-29 2005-03-24 Univ Waseda Sound source separating method and system therefor, and speech recognizing method and system therefor
JP2005084071A (en) * 2003-09-04 2005-03-31 Kddi Corp Speech recognizing apparatus
JP2007033920A (en) * 2005-07-27 2007-02-08 Nec Corp System, method, and program for noise suppression

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018031967A (en) * 2016-08-26 2018-03-01 日本電信電話株式会社 Sound source enhancement device, and method and program for the same

Also Published As

Publication number Publication date
JPWO2013145578A1 (en) 2015-12-10

Similar Documents

Publication Publication Date Title
Luo et al. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation
Bahmaninezhad et al. A comprehensive study of speech separation: spectrogram vs waveform separation
Kwon et al. NMF-based speech enhancement using bases update
US20110125496A1 (en) Speech recognition device, speech recognition method, and program
US8680386B2 (en) Signal processing device, signal processing method, and program
Grais et al. Raw multi-channel audio source separation using multi-resolution convolutional auto-encoders
JP5375400B2 (en) Audio processing apparatus, audio processing method and program
US20190172442A1 (en) Method for forming the excitation signal for a glottal pulse model based parametric speech synthesis system
CN103559888A (en) Speech enhancement method based on non-negative low-rank and sparse matrix decomposition principle
JP6195548B2 (en) Signal analysis apparatus, method, and program
JP7176627B2 (en) Signal extraction system, signal extraction learning method and signal extraction learning program
EP2912660A1 (en) Method for determining a dictionary of base components from an audio signal
Sharma et al. Study of robust feature extraction techniques for speech recognition system
JPWO2015129760A1 (en) Signal processing apparatus, method and program
CN110797033A (en) Artificial intelligence-based voice recognition method and related equipment thereof
JPWO2014168022A1 (en) Signal processing apparatus, signal processing method, and signal processing program
CN108369803B (en) Method for forming an excitation signal for a parametric speech synthesis system based on a glottal pulse model
JP6348427B2 (en) Noise removal apparatus and noise removal program
Wiem et al. Unsupervised single channel speech separation based on optimized subspace separation
JP5974901B2 (en) Sound segment classification device, sound segment classification method, and sound segment classification program
JP5994639B2 (en) Sound section detection device, sound section detection method, and sound section detection program
JP2020060757A (en) Speaker recognition device, speaker recognition method, and program
JP5726790B2 (en) Sound source separation device, sound source separation method, and program
KR101802444B1 (en) Robust speech recognition apparatus and method for Bayesian feature enhancement using independent vector analysis and reverberation parameter reestimation
WO2013145578A1 (en) Audio processing device, audio processing method, and audio processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13767645

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014507376

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13767645

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE