WO2013145578A1

WO2013145578A1 - Audio processing device, audio processing method, and audio processing program

Info

Publication number: WO2013145578A1
Application number: PCT/JP2013/001448
Authority: WO
Inventors: 隆行荒川
Original assignee: 日本電気株式会社
Priority date: 2012-03-30
Filing date: 2013-03-07
Publication date: 2013-10-03
Also published as: JPWO2013145578A1

Abstract

Provided are an audio processing device, audio processing method, and audio processing program that can accurately acquire solely desired audio from a signal in which the desired audio and other signals are mixed. The audio processing device is provided with: a distance computation unit (102) for calculating the distance from an input signal to standard patterns of a plurality of mixed signal spectra calculated by respectively combining the standard patterns of a plurality of desired audio spectra, the standard patterns of a plurality of undesired audio spectra, and the standard patterns of a plurality of noise audio spectra; and a desired audio spectrum expected value calculation unit (104) for calculating an expected value of a desired audio spectrum within the input signal, by using the distances calculated by the distance computation unit (102).

Description

Audio processing apparatus, audio processing method, and audio processing program

The present invention relates to a voice processing apparatus, a voice processing method, and a voice processing program that extract only a desired voice in a noisy environment or an environment where a plurality of people are talking.

There is a technology that extracts only the desired voice in a noisy environment or an environment where multiple people are talking. In such a technical field, signals from multiple sound sources are acquired by multiple microphones, sound source separation is performed on the obtained mixed signal using independent component analysis, and speech recognition is performed for each of the separated signals. There is a method of determining whether or not the voice is in a task assumed by using the reliability information of voice recognition obtained together with the voice recognition result (see, for example, Patent Document 1). Also, signals from multiple sound sources are acquired by multiple microphones, sound source separation is performed on the obtained mixed signal, and the signal is masked using the separation reliability of the separated signals, so that unerased parts are left out. There is a method of making it inconspicuous (see, for example, Patent Document 2). In addition, there is a method in which a noise estimation value is removed from a mixed signal to obtain temporary estimated speech, and the temporary estimated speech is corrected using a standard pattern (see, for example, Patent Document 3).

JP 2011-107603 A JP 2010-49249 A JP 2007-33920 A

However, the method described in Patent Document 1 is effective when the performance of sound source separation is not sufficient due to the presence of non-directional environmental noise, reverberation, or the existence of sound sources with more than the number of microphones. I can't expect it. In the method described in Patent Document 2, when the performance of sound source separation is not sufficient, a large value mask is performed and the desired sound is also erased. In the method described in Patent Document 3, when the estimated noise value is low and the temporary estimated speech is significantly different from the original speech, the correction using the standard pattern may fail.

In view of the above, an object of the present invention is to provide an audio processing device, an audio processing method, and an audio processing program capable of accurately acquiring only desired audio from a signal in which desired audio and other signals are mixed. And

The speech processing apparatus according to the present invention includes a plurality of mixed signal spectra calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively. A distance calculation unit that calculates a distance to the input signal for each of the standard patterns, and a desired speech spectrum expected value calculation unit that calculates an expected value of the desired speech spectrum in the input signal using the distance. It is characterized by having.

The speech processing method according to the present invention includes a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns, a distance from the input signal is calculated, and an expected value of a desired speech spectrum in the input signal is calculated as output information using the distance.

An audio processing program according to the present invention is calculated by combining a computer with a standard pattern of a plurality of desired audio spectra, a standard pattern of a plurality of undesired audio spectra, and a standard pattern of a plurality of noise spectra. For each of the standard patterns of the mixed signal spectrum, a process for calculating the distance to the input signal and a process for calculating the expected value of the desired speech spectrum in the input signal as output information using the distance are executed. It is characterized by that.

According to the present invention, it is possible to accurately acquire only the desired voice from the signal in which the desired voice and other signals are mixed. In addition, it is possible to obtain a speech recognition result for only desired speech.

It is a block diagram which shows the structure of 1st Embodiment of the audio processing apparatus by this invention. It is explanatory drawing which shows an example of the information which a standard pattern storage part stores. mu _Yijk and _{μ Ai, μ} _Bj, is an explanatory diagram showing the relationship between mu _Nk. It is a flowchart which shows operation | movement of 1st Embodiment of a speech processing unit. It is a block diagram which shows the structure of 2nd Embodiment of the audio processing apparatus by this invention. It is a flowchart which shows operation | movement of 2nd Embodiment of a speech processing unit. It is a block diagram which shows the structure of 3rd Embodiment of the audio processing apparatus by this invention. It is a flowchart which shows operation | movement of 3rd Embodiment of a speech processing unit. It is a block diagram which shows the structure of 4th Embodiment of the audio processing apparatus by this invention. It is explanatory drawing which shows an example of the information which the standard pattern storage part in 5th Embodiment stores. It is a block diagram which shows the principal part of the audio processing apparatus by this invention.

Embodiment 1. FIG.
A first embodiment of the present invention will be described below with reference to the drawings.

FIG. 1 is a block diagram showing a configuration of a first embodiment of a speech processing apparatus according to the present invention.

As shown in FIG. 1, the speech processing apparatus 100 includes a spectrum conversion unit 101, a distance calculation unit 102, a standard pattern storage unit 103, and a desired speech spectrum expected value calculation unit 104.

The speech processing apparatus 100 receives a mixed signal in which desired speech and other signals are mixed, and outputs an estimated value of the desired speech.

The signal other than the desired voice includes, for example, voice due to speech other than the desired speaker and noise other than voice. Hereinafter, a speech spectrum based on a desired speaker's utterance is referred to as a desired speech spectrum. In addition, a speech spectrum generated by speech other than the desired speaker is referred to as an undesired speech spectrum. The spectrum of a noise signal other than speech is called a noise spectrum.

The spectrum conversion unit 101 inputs a mixed signal in which desired speech and other signals are mixed. The spectrum conversion unit 101 acquires a spectrum vector of the mixed signal.

The distance calculation unit 102 calculates the distance between the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the standard pattern of the mixed signal spectrum stored in the standard pattern storage unit 103.

The standard pattern storage unit 103 stores the standard pattern of the mixed signal spectrum.

FIG. 2 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103. As shown in FIG. 2, the standard pattern storage unit 103 stores a set of a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum. Has been.

The standard pattern of the spectrum expresses an average shape for a phoneme unit such as “A”, “I”, “U” or a unit obtained by clustering similar spectral shapes. μ represents a vector whose elements are the average value of the spectrum size for each frequency band. The subscripts Y, A, B, and N indicate a mixed signal spectrum, a desired speech spectrum, an undesired speech spectrum, and a noise spectrum, respectively. The subscripts i, j, and k are serial numbers representing phoneme units or clustering units, respectively. μ _Yijk represents a mixed signal spectrum obtained by mixing the i-th desired speech spectrum, the j-th undesired speech spectrum, and the k-th noise spectrum.

FIG. 3 is an explanatory diagram showing the relationship between μ _Yijk and μ _Ai , μ _Bj , μ _Nk . In each graph shown in FIG. 3, the horizontal axis represents the frequency band, and the vertical axis represents the magnitude of the spectrum. μ _Yijk and μ _Ai , μ _Bj , μ _Nk have the relationship shown in Equation 1 for each frequency _waveband .

The standard pattern of desired speech spectrum is I from i = 1 to i = I, the standard pattern of undesired speech spectrum is J from j = 1 to j = J, and the standard pattern of noise spectrum is k = 1 to k When there are K up to = K, the number of standard patterns of the mixed signal spectrum is I × J × K. That is, the standard pattern storage unit 103 stores the I × J × K combinations.

These standard patterns are prepared in advance by using statistical processing and machine learning techniques, and are stored in the standard pattern storage unit 103. The standard pattern of the undesired voice spectrum is created based on voice data of an unspecified speaker, for example. The standard pattern of the desired speech spectrum is created based on, for example, speech data of a specific speaker.

The desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum as output information using the distance calculated by the distance calculation unit 102.

The spectrum conversion unit 101, the distance calculation unit 102, and the desired speech spectrum expected value calculation unit 104 are realized by a CPU (Central Processing Unit) provided in the speech processing apparatus 100. Further, the standard pattern storage unit 103 is realized by a storage device such as a memory provided in the sound processing device 100.

Next, the operation of this embodiment will be described.

FIG. 4 is a flowchart showing the operation of the first embodiment of the speech processing apparatus.

As shown in FIG. 4, first, the spectrum conversion unit 101 performs short-time spectrum conversion on the mixed signal input by the speech processing apparatus 100. The spectrum conversion unit 101 obtains a spectrum for each frequency band, for example, every unit time such as 10 milliseconds (step S101). In the present embodiment, the speech processing apparatus 100 performs the processes in steps S102 to S103 on a vector having as an element a spectrum for each frequency band acquired every unit time or a subband spectrum in which a plurality of frequency bands are bundled. Assumed to be performed. Let Y be the spectral vector of the mixed signal.

Next, the distance calculation unit 102 calculates a distance d _ijk between the spectrum vector Y of the mixed signal and the standard pattern μ _Yijk of the mixed signal spectrum stored in the standard pattern storage unit 103 (step S102). The distance is calculated for the number of standard pattern groups (I × J × K). The distance calculation unit 102 may obtain a general distance between vectors, for example, a Euclidean distance as the distance. The distance calculation unit 102 may obtain the distance after converting the spectrum vector into a logarithmic spectrum vector. Further, the distance calculation unit 102 may obtain the distance after converting into a feature amount generally used in speech recognition such as a cepstrum. The distance d _ijk is a scalar quantity.

Next, desired speech spectrum expected value calculation section 104 calculates an expected value of the desired speech spectrum (step S103). Desired expected value E _A of the speech spectrum is calculated by a weighted average of the standard patterns mu _Ai desired speech spectrum. E _A is a vector quantity. The weight is set to be small when the distance d _ijk is large and to be large when _dijk is small. Equation 2 is an example of a formula for E _A.

Speech processing apparatus 100, specifically, the output unit the speech processing apparatus 100 (not shown), based on the E _A is the output information and outputs the voice recognition results. At this time, the audio processing apparatus 100 may output a E _A as an estimate of the desired sound, perform such inverse Fourier transform, may be converted and output into waveform signals (wave Signal), waveform Speech recognition (speech recognition) may be performed on the signal, and the signal may be converted into text and output.

Here, the standard pattern storage unit 103 stores a set of four standard patterns: a standard pattern of a mixed signal spectrum, a standard pattern of a desired speech spectrum, a standard pattern of an undesired speech spectrum, and a standard pattern of a noise spectrum. However, other storage methods are possible.

For example, the standard pattern storage unit 103 may store only a set of a standard pattern of a mixed signal spectrum and a standard pattern of a desired speech spectrum (a set A of standard patterns shown in FIG. 2). This is because only these two standard patterns are necessary when the desired speech spectrum is actually extracted from the input signal. The other two standard patterns are required only when the standard pattern of the mixed signal is created in advance. Also in this case, a set of I × J × K standard patterns is stored in the standard pattern storage unit 103. Specifically, a set of standard patterns of J × K mixed signal spectra is stored for one standard pattern of desired speech spectrum.

Further, even if only the set of the standard pattern of the desired speech spectrum, the standard pattern of the undesired speech spectrum and the standard pattern of the noise spectrum (standard pattern set B shown in FIG. 2) is stored in the standard pattern storage unit 103. Good. This is because the standard pattern of the mixed signal spectrum can be calculated using Equation 1 when extracting the desired sound from the input signal.

As described above, according to the present embodiment, the speech processing apparatus can only perform speech from a desired speaker from a speech mixed with speech from a desired speaker, speech from other than the desired speaker, and noise. Can be obtained accurately.

Embodiment 2. FIG.
Hereinafter, a second embodiment of the present invention will be described with reference to the drawings.

FIG. 5 is a block diagram showing the configuration of the second embodiment of the speech processing apparatus according to the present invention.

As shown in FIG. 5, in addition to the configuration of the speech processing apparatus 100 of the first embodiment, the speech processing apparatus 200 of the present embodiment includes a mixed signal spectrum expected value calculation unit 201, a desired speech enhancement filter calculation unit 202, And a desired speech spectrum estimation unit 203.

The mixed signal spectrum expected value calculation unit 201 calculates the expected value of the mixed signal spectrum using the distance calculated by the distance calculation unit 102.

The desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum and the expected value of the mixed signal spectrum.

The desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum as output information based on the spectrum vector of the mixed signal acquired by the spectrum conversion unit 101 and the desired speech enhancement filter.

Note that the mixed signal spectrum expected value calculation unit 201, the desired speech enhancement filter calculation unit 202, and the desired speech spectrum estimation unit 203 are realized by a CPU provided in the speech processing apparatus 200.

The spectrum conversion unit 101, the distance calculation unit 102, the standard pattern storage unit 103, and the desired speech spectrum expected value calculation unit 104 are the same as those in the first embodiment, and thus the description thereof is omitted.

Next, the operation of this embodiment will be described.

FIG. 6 is a flowchart showing the operation of the second embodiment of the speech processing apparatus.

Since the processing of steps S201 to S203 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.

After step S203, the mixed signal spectrum expected value calculation unit 201 calculates an expected value of the mixed signal spectrum (step S204). The expected value E _Y of the mixed signal spectrum is calculated as a weighted average of the standard pattern μ _Yijk of the mixed signal spectrum. E _Y is a vector quantity. The calculation method in step S204 is the same as that in step S103 of the first embodiment. Equation 3 is an example of a formula for E _Y.

Next, the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S205). Formula 4 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 4 is performed for each vector element. W is a vector quantity.

Next, the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S206). Desired calculation of the speech spectrum estimate F _A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element. F _A is a vector quantity. Equation 5 is an example of a formula for F _A.

Here, the symbol “x” in Equation 5 indicates multiplication for each element of the vector. The speech processing apparatus 200 may output the estimated value of the desired speech spectrum, which is output information, as it is, or performs inverse Fourier transform to return the waveform signal (wave signal) to the output form. Alternatively, speech recognition (speech recognition) may be performed on the waveform signal, and the waveform signal may be converted into text and output.

Here, although the expression 4 as an example of a desired speech enhancement filter, the desired speech enhancement filter calculating section 202, for example, a plurality unit time value of the desired speech spectrum expectation E _A and mixed signal spectrum expectation E _Y Cross averaging or smoothing may be performed. Alternatively, the W value calculated using Equation 4 may be averaged or smoothed over a plurality of unit times.

As described above, in the present embodiment, averaging and smoothing are performed at the stage of desired speech enhancement filter calculation. This makes it robust against instantaneous estimation errors. Therefore, in addition to the effects of the first embodiment, a more reliable output can be obtained.

Embodiment 3. FIG.
Hereinafter, a third embodiment of the present invention will be described with reference to the drawings.

FIG. 7 is a block diagram showing the configuration of the third embodiment of the speech processing apparatus according to the present invention.

As shown in FIG. 7, in addition to the configuration of the speech processing apparatus 100 of the first embodiment, the speech processing apparatus 300 of the present embodiment includes an undesired speech spectrum expected value calculation unit 301 and a noise spectrum expected value calculation unit 302. Including. Furthermore, the speech processing apparatus 300 includes a desired speech enhancement filter calculation unit 202 and a desired speech spectrum estimation unit 203.

The undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum using the distance calculated by the distance calculation unit 102.

The expected noise spectrum value calculation unit 302 uses the distance calculated by the distance calculation unit 102 to calculate the expected value of the noise spectrum.

The undesired speech spectrum expected value calculation unit 301 and the noise spectrum expected value calculation unit 302 are realized by a CPU included in the speech processing apparatus 300.

Further, the desired speech enhancement filter calculation unit 202 and the desired speech spectrum estimation unit 203 are the same as those in the second embodiment, and thus description thereof is omitted.

However, in the present embodiment, the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter using the expected value of the desired speech spectrum, the expected value of the undesired speech spectrum, and the expected value of the noise spectrum. .

Next, the operation of this embodiment will be described.

FIG. 8 is a flowchart showing the operation of the third embodiment of the speech processing apparatus.

Since the processing of steps S301 to S303 is the same as the processing of steps S101 to S103 of the first embodiment, description thereof is omitted.

After step S303, the undesired speech spectrum expected value calculation unit 301 calculates the expected value of the undesired speech spectrum (step S304). Expected value E _B of the desired outside the speech spectrum is calculated by the weighted average of the standard patterns mu _Bj desired outside the speech spectrum. E _B is a vector quantity. The calculation method in step S304 is the same as that in step S103 in the first embodiment. Equation 6 is an example of a formula for E _B.

Next, the expected noise spectrum value calculation unit 302 calculates the expected value of the noise spectrum (step S305). The expected value E _N of the noise spectrum is calculated as a weighted average of the standard pattern μ _Nk of the noise spectrum. E _N is a vector quantity. The calculation method in step S305 is the same as that in step S103 of the first embodiment. Equation 7 is an example of a formula for E _N.

Next, the desired speech enhancement filter calculation unit 202 calculates a desired speech enhancement filter (step S306). Formula 8 is an example of a calculation formula for the desired speech enhancement filter W. The division in Equation 8 is performed for each element of the vector. W is a vector quantity.

Next, the desired speech spectrum estimation unit 203 calculates an estimated value of the desired speech spectrum (step S307). Desired calculation of the speech spectrum estimate F _A is performed by multiplying the mixed signal spectrum vector Y with the desired speech enhancement filter W for each element. F _A is a vector quantity.

Here, the symbol “x” in Equation 9 indicates multiplication for each element of the vector. The speech processing apparatus 300 may output the estimated value of the desired speech spectrum, which is output information, as it is, may perform inverse Fourier transform, return to the waveform signal form, or output the waveform signal. Speech recognition may be performed, converted into text, and output.

Here, although the expression 8 as an example of a desired speech enhancement filter, the desired speech enhancement filter calculating section 202, for example, the desired speech spectrum expectation E _A and the desired extracellular signal spectrum expectation E _B and the noise signal spectrum expected value the value of E _N may be averaged or smoothed over a plurality of units of time. Alternatively, the value of W calculated using Equation 8 may be averaged or smoothed over a plurality of unit times.

As described above, in this embodiment, as in the second embodiment, averaging and smoothing operations are performed at the stage of desired speech enhancement filter calculation. This makes it robust against instantaneous estimation errors. Therefore, in addition to the effects of the first embodiment, a more reliable output can be obtained.

Embodiment 4 FIG.
Hereinafter, a fourth embodiment of the present invention will be described with reference to the drawings.

FIG. 9 is a block diagram showing the configuration of the fourth embodiment of the speech processing apparatus according to the present invention.

As shown in FIG. 9, in addition to the configuration of the speech processing apparatus 100 of the first embodiment, the speech processing apparatus 400 of the present embodiment includes a reliability calculation unit 401, a mask setting unit 402, and a mask applying unit 403. including.

The reliability calculation unit 401 calculates the reliability using the distance calculated by the distance calculation unit 102. Here, entropy is given as an example of reliability. Equation 10 is an example of a formula for calculating entropy H. Entropy is an indicator of randomness, and the greater the value, the lower the reliability. Entropy is a scalar quantity.

Another example of a method of calculating the reliability, a mixed signal spectrum Y input, a method using the distance between the expected value E _Y of the mixed signal spectrum is considered. Similarly, the index calculated by this method has a lower reliability as the value increases.

The mask setting unit 402 sets a mask value according to the reliability. The mask setting unit 402 sets a large mask value when the reliability is low, and sets a small mask value when the reliability is high. For example, the value of the entropy H itself or an amount obtained by multiplying the entropy H by a positive multiplier can be used as the mask M. A mask is a vector quantity.

The mask values may all be the same value for each frequency band, or may be different values. Here, it is assumed that the mask setting unit 402 sets all the mask values to the same value. Expression 11 is an expression showing an example of the relationship between the mask M and the entropy H.

The mask assigning unit 403 calculates an estimated value F _A ′ of the desired speech spectrum added with the mask as output information. Formula 12 is an example of a formula for calculating F _A ′.

Note that the reliability calculation unit 401, the mask setting unit 402, and the mask applying unit 403 are realized by a CPU provided in the voice processing device 400.

As described above, according to the present embodiment, in addition to the effects of the first embodiment, the reliability of the estimated value of the estimated desired speech spectrum is low and other signals may be mixed. When it is high, the adverse effects of other signals can be suppressed.

In the present embodiment, the reliability calculation unit 401 calculates the reliability using a distance from a standard pattern prepared in advance. Therefore, it can be expected that the reliability in the present embodiment is a more accurate value than the reliability used in the method described in Patent Document 2.

Embodiment 5. FIG.
Hereinafter, a fifth embodiment of the present invention will be described with reference to the drawings.

Since the configuration of the fifth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the fifth embodiment may be the same as that of the other embodiments.

FIG. 10 is an explanatory diagram illustrating an example of information stored in the standard pattern storage unit 103 according to the fifth embodiment.

In this embodiment, the standard pattern storage unit 103 stores a probability density function (Probability Density Function) instead of an average value as a standard pattern of a mixed signal spectrum, as shown in FIG. Specifically, the standard pattern storage unit 103 stores a probability density function representing the frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes as the standard pattern of the mixed signal spectrum. To do.

As an example of the probability density function, for example, a Gaussian distribution function, a mixed Gaussian distribution function, or the like can be considered. When a Gaussian distribution function is used, a variance value is used in addition to the average value.

Corresponding to the fact that the standard pattern of the mixed signal spectrum is replaced with the probability density function from the average value, the distance calculation unit 102 uses the Batacharya distance, the Mahalanobis distance, the likelihood, the log likelihood, and the like as the distance scale.

According to the present embodiment, since the standard pattern of the mixed signal spectrum can be expressed using a higher-order statistic such as variance in addition to the average, it is possible to estimate the desired speech with higher accuracy.

Embodiment 6. FIG.
The sixth embodiment of the present invention will be described below.

Since the configuration of the sixth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the sixth embodiment may be the same as that of the other embodiments.

In this embodiment, the way of storing a set of standard patterns stored in the standard pattern storage unit 103 is different.

In the first embodiment, as shown in FIG. 2, the standard pattern storage unit 103 includes three types of standard patterns I of desired speech spectrum, J standard patterns of undesired speech spectrum, and K standard patterns of noise spectrum. The standard patterns of I × J × K mixed signal spectra created by combining one standard pattern each are stored.

On the other hand, in the present embodiment, the standard pattern storage unit 103 creates an L created by clustering and summing up spectrum patterns having similar shapes to the standard patterns of I × J × K mixed signal spectra. Stores the standard pattern of the mixed signal spectrum.

Also, by combining the standard patterns of the mixed signal spectrum, the standard patterns of the corresponding desired speech spectrum, the standard pattern of the undesired speech spectrum, and the standard pattern of the noise spectrum are collected. An example of grouping is shown in Equation 13.

For example, standard patterns of two mixed signal spectra of μ _Y (i = 1, j = 2, k = 1) and μ _Y (i = 1, j = 7, k = 4) are collected and μ _Y (L Suppose that = 10). At this time, the standard pattern of the corresponding desired speech spectrum, the standard pattern of the undesired speech spectrum, and the standard pattern of the noise signal spectrum are calculated as averages as shown in Expression 13.

In the present embodiment, the number of standard patterns stored in the standard pattern storage unit 103 is changed by collecting standard patterns. Therefore, the number of standard patterns can be changed according to the calculation resources, memory resources, and the like of the voice processing device.

Embodiment 7. FIG.
The seventh embodiment of the present invention will be described below.

Since the configuration of the seventh embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the seventh embodiment may be the same as that of the other embodiments.

In the present embodiment, the spectrum conversion unit 101 inputs the signal output from the voice processing device again. Then, the sound processing device repeats the processes of steps S101 to S103.

Note that, when the processing is repeated, the speech processing apparatus may reduce the number of standard patterns in the first processing and increase the number of standard patterns as the number of times is repeated. In this case, the estimated value of the desired speech is an average of coarse spectral shapes when the number of standard patterns is small, and an average of spectral shapes in finer units when the number of standard patterns is large.

According to the present embodiment, it is possible to improve the accuracy of estimation of a desired speech spectrum from coarse estimation to fine estimation with higher accuracy.

Embodiment 8. FIG.
The eighth embodiment of the present invention will be described below.

Since the configuration of the eighth embodiment is the same as that of the first embodiment, description thereof is omitted. Note that the configuration of the eighth embodiment may be the same as that of the other embodiments.

In the present embodiment, the spectrum conversion unit 101 inputs signals from a plurality of microphones. Then, the sound processing apparatus performs the processes of steps S101 to S103 in parallel for each of the plurality of signals input by the spectrum conversion unit 101.

According to the present embodiment, it is possible to obtain a plurality of estimated values of desired speech spectrum.

Note that the speech processing apparatus may output the estimated value having the highest reliability among the obtained estimated values, may output the average value of the obtained estimated values, or may obtain the obtained plural values. The maximum value of the estimated values may be output.

Further, the spectrum conversion unit 101 may input a signal after performing the sound source separation process.

As mentioned above, although each embodiment of this invention was explained in full detail, the component in each embodiment is an illustration to the last, and is not the meaning which limits the technical scope of this invention only to them. Therefore, a system or apparatus in any combination of the separate features included in each embodiment is also included in the scope of the present invention.

Further, the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where a control program (speech processing program) that realizes the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention with a computer, a control program installed in the computer, a medium storing the control program, and a WWW (World Wide Web) server for downloading the control program are also included in the scope of the present invention. include.

FIG. 11 is a block diagram showing the main part of the speech processing apparatus of the present invention. The speech processing apparatus of the present invention includes a distance calculation unit 102 and a desired speech spectrum expected value calculation unit 104.

The distance calculation unit 102 includes a plurality of mixed signal spectrum standards calculated by combining a plurality of desired speech spectrum standard patterns, a plurality of undesired speech spectrum standard patterns, and a plurality of noise spectrum standard patterns, respectively. The distance between the input signal and each pattern is calculated.

The desired speech spectrum expected value calculation unit 104 calculates the expected value of the desired speech spectrum in the input signal using the distance calculated by the distance calculation unit 102.

In the above embodiment, the following audio processing apparatus is also disclosed.

(Additional remark 1) The spectrum conversion part 101 is a speech processing apparatus which inputs the signal which the speech processing apparatus output as an input signal.

According to such a form, when the process for the input signal is repeated, the number of standard patterns is reduced in the first process, and the number of standard patterns is increased as the number of repetitions is repeated. It is possible to improve the accuracy of estimation of a desired speech spectrum to a fine estimation with high accuracy.

(Supplementary Note 2) The spectrum conversion unit 101 receives a plurality of input signals, and each unit of the sound processing device performs processing on the plurality of input signals in parallel.

According to such a form, it is possible to obtain a plurality of estimated values of desired speech spectrum.

This application claims priority based on Japanese Patent Application No. 2012-081586 filed on March 30, 2012, the entire disclosure of which is incorporated herein.

The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above-described embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

100, 200, 300, 400 Speech processing apparatus 101 Spectrum conversion unit 102 Distance calculation unit 103 Standard pattern storage unit 104 Expected speech spectrum expected value calculation unit 201 Mixed signal spectrum expected value calculation unit 202 Desired speech enhancement filter calculation unit 203 Desired speech spectrum Estimating unit 301 Undesired speech spectrum expected value calculating unit 302 Noise spectrum expected value calculating unit 401 Reliability calculating unit 402 Mask setting unit 403 Mask applying unit

Claims

For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. A distance calculation unit for calculating a distance between the input signal and
A speech processing apparatus, comprising: a desired speech spectrum expected value calculation unit that calculates an expected value of a desired speech spectrum in the input signal using the distance.
The desired speech spectrum expected value calculation unit calculates the expected value of the desired speech spectrum in the input signal by weighted averaging the standard patterns of a plurality of desired speech spectra using the distance calculated by the distance calculation unit. The voice processing apparatus according to 1.
A spectrum conversion unit that inputs a mixed signal in which a desired voice uttered by a desired speaker and a signal other than the desired voice are mixed as an input signal, and converts the input signal into a spectrum every unit time;
A standard pattern storage unit for storing a plurality of sets of standard patterns of the mixed signal spectrum and standard patterns of the corresponding desired speech spectrum;
The distance calculation unit calculates a distance between a spectrum of the input signal and a standard pattern of the plurality of mixed signal spectra,
The desired speech spectrum expected value calculation unit calculates a weighted average of the standard patterns of the corresponding desired speech spectrum using the distance, and calculates an expected value of the desired speech spectrum as output information. The speech processing apparatus according to the description.
Using the distance calculated by the distance calculation unit, the standard pattern of the mixed signal spectrum is weighted and averaged to calculate the expected value of the mixed signal spectrum;
A desired speech spectrum enhancement filter calculator that calculates a desired speech spectrum enhancement filter from an expected value of the desired speech spectrum and an expected value of the mixed signal spectrum;
The desired speech spectrum estimation part which calculates the estimated value of a desired speech spectrum as output information from the spectrum of the said desired speech spectrum emphasis filter and an input signal is given in any 1 paragraph of Claims 1-3. Audio processing device.
An undesired speech spectrum expectation value calculator, and a noise spectrum expectation value calculator,
The standard pattern storage unit stores a plurality of sets of standard patterns of undesired speech spectrums and standard patterns of noise spectra in addition to standard patterns of mixed signal spectra and corresponding standard patterns of desired speech spectra,
The undesired speech spectrum expectation value calculation unit uses the distance calculated by the distance calculation unit to perform a weighted average of the standard patterns of the undesired speech spectrum and calculate an expected value of the undesired speech spectrum.
The noise spectrum expected value calculation unit uses the distance calculated by the distance calculation unit, calculates a weighted average of the standard pattern of the noise spectrum, and calculates an expected value of the noise spectrum,
The desired speech spectrum enhancement filter calculation unit calculates a desired speech spectrum enhancement filter from the expected value of the desired speech spectrum, the expected value of the non-desired speech spectrum, and the expected value of the noise spectrum,
The speech processing apparatus according to claim 3, wherein the desired speech spectrum estimation unit calculates an estimated value of a desired speech spectrum as output information from the desired speech spectrum enhancement filter and an input signal.
A reliability calculation unit that calculates reliability using the distance calculated by the distance calculation unit;
A mask setting unit that sets a mask using the reliability;
The speech processing apparatus according to claim 1, further comprising: a mask adding unit that adds the mask to the output information and generates new output information.
The standard pattern storage unit stores, as the standard pattern, an average of spectra calculated for each unit obtained by clustering phoneme units or similar spectrum shapes. 7. The speech processing apparatus according to the description.
The standard pattern storage unit stores, as a standard pattern, a probability density function representing a frequency of appearance of a spectrum obtained for each unit obtained by clustering phoneme units or similar spectrum shapes. The speech processing device according to any one of the above.
For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. , Calculate the distance between the input signal and
An audio processing method, wherein an expected value of a desired audio spectrum in the input signal is calculated as output information using the distance.
On the computer,
For each of the standard patterns of a plurality of mixed signal spectra calculated by combining a standard pattern of a plurality of desired speech spectra, a standard pattern of a plurality of undesired speech spectra, and a standard pattern of a plurality of noise spectra, respectively. A process of calculating the distance between the input signal and
A sound processing program for executing a process of calculating an expected value of a desired sound spectrum in the input signal as output information using the distance.