WO2021149213A1

WO2021149213A1 - Learning device, speech emphasis device, methods therefor, and program

Info

Publication number: WO2021149213A1
Application number: PCT/JP2020/002270
Authority: WO
Inventors: 悠馬小泉
Original assignee: 日本電信電話株式会社
Priority date: 2020-01-23
Filing date: 2020-01-23
Publication date: 2021-07-29

Abstract

In this invention, a second approximation function is obtained by updating a first approximation function that approximates an objective index simulating the subjective sound-quality evaluation of an input so as to minimize a first cost function that is based on the sum of: an error between the subjective sound-quality evaluation of a target sound and an objective index that simulates the subjective sound-quality evaluation of the target sound and is obtained by inputting a target speech signal representing the target sound into a first approximation function; an error between the subjective sound-quality evaluation of an observation sound based on the target sound and an objective index that simulates the subjective sound-quality evaluation of the observation sound and is obtained by inputting an observation signal representing the observation sound into the first approximation function; and an error between the subjective sound-quality evaluation of an emphatic sound that corresponds to a masked sound signal obtained by applying a first mask to the observation signal and an objective index that simulates the subjective sound-quality evaluation of the emphatic sound and is obtained by inputting an emphatic speech signal representing the emphatic sound into the first approximation function.

Description

Learning devices, speech enhancement devices, their methods, and programs

The present invention relates to a speech enhancement technique.

^{It is assumed that the observed signal x ∈ RT} in the time domain of the T sample is a mixed signal x = s + n of the target audio signal s and the noise signal n. T is an integer greater than or equal to 1. The purpose of speech enhancement is to estimate s from x with high accuracy. ^{As illustrated in Equation (1), in DNN voice enhancement, the observation signal X = Q (x) ∈ C F × K in} which the observation signal x is expressed in the time domain region by the frequency domain conversion process Q such as short-time Fourier transform. Obtained, and multiply X by the time frequency (TF) mask M (x; θ) estimated using DNN to obtain the post-masked audio signal M (x; θ) ◎ Q (x), and further multiply the post-masked audio signal M (x; θ). ^{A time domain conversion process Q +} such as an inverse FTFT is applied to the signal M (x; θ) ⊚ Q (x) to obtain an emphasized audio signal y.
y = Q ⁺ (M (x; θ) ◎ Q (x)) (1)
Here, R represents the set of all real numbers, and C represents the set of all complex numbers. T, F, and K are positive integers, T represents the number of observed signals x belonging to a predetermined time interval (time length), and F represents the number of discrete frequencies belonging to a predetermined band in the time frequency domain (bandwidth). Represents, and K represents the number of discrete times (time length) belonging to a predetermined time interval in the time frequency domain. M (x; θ) ◎ Q (x) represents multiplying Q (x) by the TF mask M (x; θ). ◎ is the Hadamard product. θ is a parameter of DNN, and is usually learned to minimize the ^{signal-to-distortion ratio (SDR) L SDR} represented by, for example, the following equation (2).
L ^SDR =-(clip _β [SDR (s, y)] + clip _β [SDR (n, m)]) / 2 (2)
However,

And

Is _{L 2} norm, is _{m = x-y, clip β} [χ] = a β · tanh (χ / β) , β> 0 is a clipping constant. For example, β = 20 (see, for example, Non-Patent Documents 1 to 4 and the like).

The reason that equation (2) is widely used as a cost function for DNN speech enhancement is that ^LSDR is differentiable with respect to θ. In general, DNN learning is performed by a gradient method using a gradient obtained by a method called an error back propagation method.

Here _∂ θ is a differential operator on θ. Since L ^SDR can be calculated analytically, it can DNN learning efficiently performed. λ is a positive constant.

It is known that an analytically gradient function such as SDR does not always match the subjective sound quality evaluation. Therefore, when learning the θ to minimize L ^SDR, SDR is reduced it is possible sound quality is deteriorated. To solve this, there is a method of using an objective sound quality assessment score (OSQAS) that imitates the subjective evaluation of human sound quality (OSQAS) as a cost function, such as the patentual evaluation of speech quality (PESQ). Patent Documents 1 to 3). Now, assuming that P (s, y) is OSQAS calculated from s and y, the cost function is as follows.
L ^P = -E [P (s, y)] _{x, y} (4)
Here, E [p (x)] _x is the expected value of p (x) with respect to x. The problem with this cost function is that the gradient ∂ θ L ^P _{with respect to} θ of many OSQAS cannot be calculated analytically.

Therefore, in the prior art described in Non-Patent Document 3, _{a method of approximating P (s, y) with a differentiable function D φ} (s, y) having a parameter Φ has been proposed.

Here, for example, if D _φ (s, y) is designed by DNN, D _φ (s, y) is differentiable with respect to ·, and ∂ _θ D _φ (s, y) can be calculated analytically. To achieve this, we first _{learn Dφ} (s, y) to minimize the following cost function.
L ^{D (GAN)} = ε _φ (s) + ε _φ (y) (6)
Where ε _φ (・) is the squared error between the true OSQAS and the estimated OSQAS ε _φ (・) = (P (s, ・) － D _φ (s, ・)) ² . In other words, this prior art, (i) the error epsilon _phi of the current OSQAS (y), (ii) to minimize the error epsilon _phi (s) is the OSQAS in the case of a perfect speech enhancement D _phi ( Learn s, ・). _{Next, the model M θ} that generates the time frequency mask M (x; θ) from the observation signal x is learned so as to minimize the following cost function LM ^(GAN).
L ^{M (GAN)} = (D _φ (s, y) ―― 1) ² (7)
However, OSQAS is normalized so that 0 ≦ P (s, y) ≦ 1, and it is assumed that P (s, s) = 1.

The problem with the prior art described in Non-Patent Document 3 is the stability of learning. In order to proceed with learning so as to stably improve the OSQAS of the test data, the approximation of the equation (5) needs to be highly accurate. However, in this conventional technique, OSQAS is not stably improved even if the number of learnings is increased. Therefore, there is still room for improvement in this prior art.

The present invention has been made in view of these points, and an objective index that imitates the subjective evaluation of human sound quality is approximated to a differentiable function with high accuracy, and the learning of the differentiable function is stabilized. The purpose is to make it.

Approximate the subjective sound quality evaluation of the target sound and the objective index that imitates the subjective sound quality evaluation of the input Objective that imitates the subjective sound quality evaluation of the target sound obtained by inputting the target sound signal representing the target sound into the first approximation function. An objective index that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the error of the index, the subjective sound quality evaluation of the observed sound based on the target sound, and the observation signal representing the observed sound into the first approximation function. And the subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound is input to the first approximation function. The first approximation function is updated so as to minimize the first cost function based on the sum of the error of the objective index imitating the subjective sound quality evaluation of the emphasized sound, and the second approximation function is obtained.

In the present invention, an objective index that imitates the subjective evaluation of human sound quality can be approximated to a differentiable function with high accuracy, and the learning of the differentiable function can be stabilized.

FIG. 1 is a block diagram illustrating the functional configuration of the learning device of the embodiment. FIG. 2 is a diagram for explaining the learning method of the embodiment. FIG. 3 is a diagram for explaining the learning method of the embodiment. FIG. 4 is a block diagram for explaining the functional configuration of the speech enhancement device of the embodiment. FIG. 5 is a diagram for explaining a speech enhancement method of the embodiment. It is a figure for intuitively exemplifying the learning result of the differentiable function which approximates OSQAS. It is a figure for demonstrating the learning result of embodiment. FIG. 8 is a block diagram for explaining a hardware configuration.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[principle]
First, the principle will be explained. The present embodiment provides a method for improving the accuracy of the approximation of equation (5) and _{stabilizing the learning of the differentiable function Dφ} (s, ·). In this embodiment, a cost function ^{L D} instead of the cost function ^{L D (GAN).} 6, intuitively illustrate (a) the cost function ^{L D (GAN)} is trained to minimize the D φ _(s, ·), which minimizes the cost function ^{L D} in (b) _{Intuitively exemplify the D φ} (s, ·) learned in this way. The solid line represents the true OSQAS, dotted and broken lines indicate D φ _(s, ·) learned. The horizontal axis represents the amount of noise contained in the input observation sound, and the vertical axis represents the PESQ score.

In non-patent documents, _{D φ} (s, ·) is learned so as to minimize ^{the cost function LD (GAN)} , which is (i) OSQAS (P (s, y)) before speech enhancement. The error ε _φ _{(y) with respect to (ii) and D φ} (s, ·) to _{minimize the error ε φ} (s) with respect to OSQAS (P (s, s)) in the case of perfect speech enhancement. To learn. Here, (iii) an error in OSQAS when speech enhancement fails is not taken into consideration. Therefore, when _Dφ (s, ·) is learned by the conventional technique, it can be as shown by a dotted line or a broken line in FIG. 6 (a). _{When D φ} (s, ·) is learned as shown by the broken line, the _{learning does not proceed because the model M θ} is at the stationary point at the position of P (s, y), and in the worst case, OSQAS deteriorates. Learning progresses. Therefore, in this embodiment learns the D φ _(s, ·) to minimize the cost function L ^D, taking into consideration the perspective of (iii). (iii) Since it is difficult to obtain OSQAS when speech enhancement fails, OSQAS (P (s, x)) before speech enhancement is used instead of OSQAS when speech enhancement fails. _{That is, (i) the error ε φ} (y) with respect to OSQAS (P (s, y)) before speech enhancement, and (ii) the error ε with respect to OSQAS (P (s, s)) with perfect speech enhancement. _{Learn φ} _{(s) and (iii) D φ} (s, ·) so as to minimize the _{error ε φ} (x) with respect to OSQAS (P (s, x)) when speech enhancement fails. _{That is, as shown in FIG. 6 (b), D φ} (s, s) and D _φ (s, y) at the three points P (s, s), P (s, y), and P (s, x). , D _φ (s, x) are approximated to each other. For example, the mini-batch size M, calculates the cost value of the cost function cost function L ^D below, to learn φ to minimize this.

However, M is a positive integer, j = 1, ..., M, s ^(j) is the j-th target audio signal, x ^(j) is the j-th observation signal, and y ^(j). Is an emphasized audio signal. against the j-th noise signal ^{n (j),} the observed signal ^{x (j)} is the target speech signal ^{s (j)} and the noise signal mixed signal ^x ^{^{n (j) (j) =}} s (j) + n (j) Is. s ^(j) , x ^(j) , y ^(j) , n ^(j) are time-series signals of T samples in the time domain, respectively, and the target voice signal s ^(j) represents the target sound, and the observation signal x ^(J) represents the observed sound, the emphasized voice signal y ^(j) represents the emphasized sound, and the noise signal n ^(j) represents noise. That is, the first approximation function D _φ that approximates the subjective sound quality evaluation P (s ^(j) , s ^(j) ) of the target sound and the objective index imitating the subjective sound quality evaluation P (s ^{(j), ·) of the input.} _{An objective index D φ} (s ^(j) , s ^(j) ) that imitates the subjective sound quality evaluation of the target sound, which is obtained by inputting the ^{target sound signal s (j)} representing the target sound into (s ^{(j), ·).} ), The error ε _φ (s ^(j) ), the subjective sound quality evaluation P (s ^(j) , x ^(j) ) of the observed sound based on the target sound, and the first approximation function D _φ (s ^(j)). , ・) The error ε _{with the objective index Dφ} (s ^(j) , x ^(j) ) that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the ^{observation signal x (j) representing the observed sound.} Subjective sound quality evaluation P (s) of emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask M (x ^(j) _{; θ) to φ} (x ^(j) ) and the observation signal x ^(j). ^(J) , y ^(j) ) and the first approximation function D _φ (s ^(j) , ·) are input to the emphasized sound signal y (j), which is obtained by inputting the ^{emphasized sound signal y (j).} objective indices D imitating a _{^{^{φ (s (j), y}}} (j)) and, the error ε φ ^{_(y (j))} and the first so as to minimize the first cost function ^{L D} based on the sum of Update the approximation function D _φ (s ^(j) , ·).

Pre-learning can also be used. Although SDR and OSQAS do not completely match, the higher the SDR, the higher the OSQAS tends to be. _{Since learning of Dφ} (s, ·) using the cost function L ^D tends to be unstable, for example, first, the model M ^{θ is learned using the cost function L SDR} of the equation (2), and then the model M _θ. without updating the _θ using the cost function L ^D of the formula (8) to learn D φ _(s, ·) only. Thereafter, as the final adjustment, P (s, y) of formula (4) the D _phi (s, y) and a cost function ^{L D} cost function ^{L P} and of formula (4a) is replaced with (8) alternately The parameters φ and θ may be updated using the alternating optimization used in.
L ^P = -E [D _φ (s, y)] _{x, y} (4a)

[First Embodiment]
Next, the first embodiment of the present invention will be described.
<Structure>
As illustrated in FIG. 1, the learning device 11 of the present embodiment includes

storage units

111 and 112, mask estimation application units 113a,

mask application units

113b and 113c, model application units 114a to 114c, and approximation function application units 115a to 115c. , The gradient calculation units 116a to 116h, the parameter update units 117a to 117d, the memory 118, and the control unit 119. The learning device 11 executes each process under the control of the control unit 119, and the data obtained in each process is stored in the memory 118, read out as needed, and used for other processes.

As illustrated in FIG. 4, the speech enhancement device 12 of the present embodiment includes a model storage unit 120, an input unit 121, a frequency domain conversion unit 122, a mask estimation unit 123, a mask application unit 124, a time domain conversion unit 125, and an output. It has a unit 126, a control unit 127, and a memory 128. The speech enhancement device 12 executes each process under the control of the control unit 127, and the data obtained in each process is stored in the memory 128, read out as needed, and used for other processes.

<Learning process>
The learning process of this embodiment is illustrated.
First, learning data is prepared consisting of the observed signal x ^(j) corresponding to the target speech signal s ^(j) and the target speech signal s ^(j). However, j = 1, ..., M, and M is an integer of 1 or more. The target audio signals s ⁽¹⁾ , ..., S ^(M) are stored in the storage unit 111, and the observation signals x ⁽¹⁾ , ..., X ^(M) are stored in the storage unit 112. Under this premise, the following

steps

1, 2 and 3 are executed.

≪Step 1: Pre-learning of _{model M θ≫}
In step 1, the model M _θ is pre-learned. Hereinafter, the process of step 1 will be described with reference to FIG.
First, the control unit 119 sets the parameter theta model M _theta to the initial value (step S119aa).

Next, the control unit 119 initializes to i = 1 (step S119ab).

Next, the model application unit 114a ^{extracts the observation signal x (i)} from the storage unit 112, applies the model M _θ to the observation signal x ⁽ⁱ⁾ , obtains the mask M (x ⁽ⁱ⁾ ; θ), and outputs the mask M (x (i); θ). (Step S114a).

The mask M (x ⁽ⁱ⁾ ; θ) is input to the gradient calculation unit 116a. Gradient calculation unit 116a extracts the observed signal ^{x (i)} from the storage unit 111, and outputs the resulting gradient _{_{^{∂ φ L SDR M (i)}}} . However, the following equations (1a) and (2a) are satisfied,
y ⁽ⁱ⁾ = Q ⁺ (M (x ⁽ⁱ⁾ ; θ) ◎ Q (x ⁽ⁱ⁾ )) (1a)
L _SDR ^{M (i)} =-(clip _β [SDR (s ⁽ⁱ⁾ , y ⁽ⁱ⁾ )] + clip _β [SDR (n ⁽ⁱ⁾ , m ⁽ⁱ⁾ )]) / 2 (2a)
m ⁽ⁱ⁾ = x ⁽ⁱ⁾ −y ⁽ⁱ⁾ (step S116a).

The control unit 119 determines whether or not i = N. However, N is a positive integer less than or equal to M, and for example, N = 5 (step S119b). If i = N here, the process returns to step S114a with i + 1 as a new i.

On the other hand, if i = N, the gradient _{_{^{∂ φ L SDR M (1)}}} , ..., ∂ φ L SDR M (N) is input into the gradient calculation unit 116 b. The gradient calculation unit 116b gradient _{_{^{∂ φ L SDR M (1)}}} , ..., and outputs the resulting gradient _∂ φ L _SDR ^M using _{_{^{∂ φ L SDR M (N)}}} . However, the gradient _∂ φ L _SDR ^M is _{_{^{∂ φ L SDR M (1)}}} , ..., a function value of _{_{^{∂ φ L SDR M (N)}}} , for example _{_{^{∂ φ L SDR M = (∂}}} φ L SDR M (1) + ... it is a _{_{^{+ ∂ φ L SDR M (N}}} )) / N ( step S116b).

The gradient ∂ _φ L _SDR ^M is input to the parameter update unit 117a. Parameter updating unit 117a, and outputs the updated parameter θ by the gradient method using a gradient ∂ φ L _SDR _^M. That is, the parameter θ is updated and output so as to minimize _{L SDR} ^M = ( _LSDR ^{M (1)} + ... + L _SDR ^{M (N)) / N (step S117a).}

The control unit 119 determines whether or not the convergence condition is satisfied. The convergence condition, step S114a, S116a, S119b, S116b, the processing of S117a is repeated a certain number of times and, the change amount of θ and _{L SDR} ^M is equal to or less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119ab. If the convergence condition is satisfied, the parameter update unit 117a outputs the parameter θ and ends the process of step 1 (step S119c).

<< Step 2: _{Pre-learning of the approximate function Dφ} (s, ・) >>
In step 2, the approximation function D _φ (s, ·) is pre-learned. Hereinafter, the process of step 2 will be described with reference to FIG.
First, the control unit 119 _{sets the parameter φ of the approximation function D φ} (s, ·) to the initial value (step S119da).

Next, the control unit 119 initializes to j = 1 (step S119db).

The parameter θ obtained in step 1 is input to the mask estimation application unit 113a. The mask estimation application unit 113a ^{extracts the observation signal x (j)} from the storage unit 112 and applies the model M _θ to the observation signal x ^(j) to obtain the mask M (x ^(j) ; θ). Further, the mask estimator application unit 113a, the mask M to the observed signal ^{^{x (j) (x (j}} ); θ) by applying, to emphasize the voice signal ^{y (j)} the obtained output as in equation (1b) (Step S113a).
y ^(j) = Q ⁺ (M (x ^(j) ; θ) ◎ Q (x ^(j) )) (1b)

The observed signal x ^(j) and the emphasized audio signal y ^(j) are input to the approximation function application unit 115a. The approximation function application unit 115a further extracts the ^{target audio signal s (j) from the storage unit 111.} The approximation function application unit 115a ^{inputs s (j)} , x ^(j) , y ^(j) into the approximation function D _φ (s, ·) and D _φ (s ^(j) , s ^(j) ), D. _φ (s ^(j) , x ^(j) ), D _φ (s ^(j) , y ^(j) ) are obtained and output (step S115a).

D _φ (s ^(j) , s ^(j) ), D _φ (s ^(j) , x ^(j) ), D _φ (s ^(j) , y ^(j) ) are input to the gradient calculation unit 116c. .. ^{Further calculated P (s (j)} , s ^(j) ), P (s ^(j) , x ^(j) ), P (s ^(j) , y ^(j) ) are stored in the gradient calculation unit 116c. Entered. Next, the gradient calculation unit 116c has an error between P (s ^(j) , s ^(j) ) and D _φ (s ^(j) , s ^(j) ) _εφ (s ^(j) ), P (s ^{(j)). )} , X ^(j) ) and D _φ (s ^(j) , x ^(j) ) error ε _φ (x ^(j) ), and P (s ^(j) , y ^(j) ) and D _φ (s) ^{^(j),} to give the ^y error epsilon _phi of ^{^{(j)) (y (j}} )), to obtain a gradient ∂ _φ L ^{D (j)} is output according to equation (8a) (step S116c).
L ^{D (j)} = ε _φ (s ^(j) ) + ε _φ (x ^(j) ) + ε _φ (y ^(j) ) (8a)

The control unit 119 determines whether or not j = M. For example, M = 5 (step S119e). Here, if j = M, j + 1 is set as a new j and the process returns to step S113a.

On the other hand, if j = M, the gradient _{^{∂ φ L D (1),}} ..., ∂ φ L D (M) is input into the gradient calculation unit 116d. Gradient calculation unit 116d gradient _{^{∂ φ L D (1),}} ..., and outputs the resulting gradient ∂ _phi L ^D using ∂ φ L ^{D _(M).} However, the gradient ∂ _φ L ^D is _{a function value of ∂ φ} L ^{D (1)} , ..., ∂ _φ L ^{D (M)} , for example, ∂ _φ L ^D = (∂ _φ L ^{D (1)} +… + ∂ _φ ^{LD (M)} ) / M (step S116d).

The gradient ∂ _φ L ^D is input to the parameter update unit 117b. Parameter updating unit 117b updates the parameter phi by the gradient method using a gradient ∂ _φ L ^D. That parameter updating unit 117b outputs the ^{L D} of formula (8) by updating the parameters φ to minimize (step S117b).

The control unit 119 determines whether or not the convergence condition is satisfied. The convergence condition, step S115a, S116c, S119e, S116d, the processing of S117b is repeated a certain number of times and, the change amount of the φ and ^{L D} is less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119db. If the convergence condition is satisfied, the parameter update unit 117b outputs the parameter φ and ends the process of step 2 (step S119f).

In step 2, the first approximation that approximates the subjective sound quality evaluation P (s ^(j) , s ^(j) ) of the target sound and the objective index that imitates the subjective sound quality evaluation P (s ^{(j), ·) of the input.} _{An objective index D φ} (s ^(j) , s) that imitates the subjective sound quality evaluation of the target sound, which is obtained by inputting the ^{target sound signal s (j)} representing the target sound to the function D _φ (s ^{(j), ·).} ^(J) ), the error ε _φ (s ^(j) ), the subjective sound quality evaluation P (s ^(j) , x ^(j) ) of the observed sound based on the target sound, and the first approximation function D _φ (s). _{An objective index Dφ} (s ^(j) , x ^(j) ) that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the ^{observation signal x (j)} representing the observed sound into ^{(j), ·),} Subjective sound quality evaluation of the emphasized sound corresponding to the error ε _φ (x ^(j) ) and the post-masked sound signal obtained by applying the first mask M (x ^(j) ; θ) to the observed signal x ^(j). The emphasized sound obtained by inputting P (s ^(j) , y ^(j) ^{) and the emphasized sound signal y (j)} representing the emphasized sound to the first approximation function _Dφ (s ^{(j), ·).} objective indices D of the subjective quality evaluation imitating _{^{^{φ (s (j), y}}} (j)) and, the error ε φ ^{_(y (j))} and, to minimize the first cost function ^{L D} based on the sum of Corresponds to the approximation function learning step in which the first approximation function D _φ (s ^(j) , ·) is updated to obtain the second approximation function D _φ (s ^{(j), ·).} In this example, the first mask M (x ^(j) ; θ) is obtained by applying the first model M _θ ^{to the observed signal x (j)} , and the approximation function learning step is the first model M. The first approximation function D _φ (s ^(j) , ·) is updated to minimize the first cost function L ^D _{without updating θ} , and the second approximation function D _φ (s ^(j) , · ·). ) Is the step to obtain.

<< Step 3: _{Learning process of model M θ} and approximate function D _φ (s, ・) >>
In the final step 3, learning is performed to alternately update the _{model M θ} and the approximate function D _{φ (s, ·).} In step 3, for example, N = 5 and M = 10. Hereinafter, the process of step 3 will be described with reference to FIG.

First, the control unit 119 sets the parameters θ and φ obtained in

steps

1 and 2 to the initial values (step S119ga).

Next, the control unit 119 initializes to i = 1, j = 1 (step S119gb).

Next, the model application unit 114b ^{extracts the observation signal x (i)} from the storage unit 112, applies the model M _θ to the observation signal x ⁽ⁱ⁾ , obtains the mask M (x ⁽ⁱ⁾ ; θ), and outputs the mask M (x (i); θ). (Step S114b).

The mask M (x ⁽ⁱ⁾ ; θ) is input to the mask application unit 113b. Q (x ⁽ⁱ⁾ ) is further input to the mask application unit 113b, and the mask application unit 113b ^{obtains and outputs the emphasized audio signal y (i)} according to the equation (1a) (step S113b).

The emphasized audio signal y ⁽ⁱ⁾ is input to the approximation function application unit 115b. The approximation function application unit 115b further extracts the ^{target audio signal s (i) from the storage unit 111.} The approximate function application unit 115a ^{inputs s (i)} and y ⁽ⁱ⁾ into the approximate function D _φ (s, ·) to obtain D _φ (s ⁽ⁱ⁾ , y ⁽ⁱ⁾ ) and outputs it (step). S115b).

D _φ (s ⁽ⁱ⁾ , y ⁽ⁱ⁾ ) is input to the gradient calculation unit 116e. Gradient calculation unit 116e and outputs the resulting gradient ∂ φ L ^{M _(i).} However, LM ⁽ⁱ⁾ satisfies the following equation (4b) (step S116e).
L ^{M (i)} = -E [D _φ (s ⁽ⁱ⁾ , y ⁽ⁱ⁾ )] _{x, y} (4b)

The control unit 119 determines whether or not i = N (step S119h). If i = N here, the process returns to step S114b with i + 1 as a new i.

On the other hand, if i = N, the gradient _{^{∂ φ L M (1),}} ..., ∂ φ L M (N) is input into the gradient calculation unit 116f. The gradient calculation unit 116f gradient _{^{∂ φ L M (1),}} ..., and outputs the resulting gradient ∂ _phi L ^M using ∂ φ L ^{M _(N).} _However, ∂ φ L ^M is _{^{∂ φ L M (1),}} ..., a function value of _∂ φ L ^{M (N),} for example _{^{_{∂ φ L M = (∂ φ}}} L M (1) + ... + ∂ φ L ^{M (N)} ) / N (step S116f).

Gradient ∂ _phi L ^M is input to the parameter update unit 117c. Parameter updating unit 117c, and outputs the updated parameter θ by the gradient method using a gradient ∂ φ L _^M. ^{^{That, L M = (L M (}} 1) + ... + L M (N)) / N and updates and outputs the parameter θ to minimize. The updated parameter θ is input to the

model application units

114b and 114c (step S117c).

Next, the model application unit 114c ^{extracts the observation signal x (j)} from the storage unit 112, applies the model M _θ to the observation signal x ^(j) , obtains the mask M (x ^(j) ; θ), and outputs the mask M (x (j); θ). (Step S114c).

The mask M (x ^(j) ; θ) is input to the mask application unit 113c. Q (x ^(j) ) is further input to the mask application unit 113c, and the mask application unit 113c ^{obtains and outputs the emphasized audio signal y (j)} according to the following equation (1a') (step S113b).
y ^(j) = Q ⁺ (M (x ^(j) ; θ) ◎ Q (x ^(j) )) (1a')

The emphasized audio signal y ^(j) is input to the approximation function application unit 115c. Further, the approximation function application unit 115c ^{extracts the target audio signal s (j)} from the storage unit 111, and extracts the observation signal x ^(j) from the storage unit 112. The approximation function application unit 115c ^{inputs s (j)} , x ^(j) , y ^(j) into the approximation function D _φ (s, ·) and D _φ (s ^(j) , s ^(j) ), D. _φ (s ^(j) , x ^(j) ), D _φ (s ^(j) , y ^(j) ) are obtained and output (step S115c).

D _φ (s ^(j) , s ^(j) ), D _φ (s ^(j) , x ^(j) ), D _φ (s ^(j) , y ^(j) ) are input to the gradient calculation unit 116g. .. ^{Further calculated P (s (j)} , s ^(j) ), P (s ^(j) , x ^(j) ), P (s ^(j) , y ^(j) ) are stored in the gradient calculation unit 116g. Entered. Next, the gradient calculation unit 116g has an error between P (s ^(j) , s ^(j) ) and D _φ (s ^(j) , s ^(j) ) _εφ (s ^(j) ), P (s ^{(j)). )} , X ^(j) ) and D _φ (s ^(j) , x ^(j) ) error ε _φ (x ^(j) ), and P (s ^(j) , y ^(j) ) and D _φ (s) ^{^(j), y} obtained error epsilon _phi a ^{(y (j))} of the ^{^(j)), (outputs} to give ^j) (equation (8a) gradient ∂ _φ L ^D) (step S116g).

The control unit 119 determines whether or not j = M (step S119i). Here, if j = M, j + 1 is set as a new j and the process returns to step S114c.

On the other hand, if j = M, the gradient _{^{∂ φ L D (1),}} ..., ∂ φ L D (M) is input into the gradient calculation unit 116h. The gradient calculation unit 116h gradient _{^{∂ φ L D (1),}} ..., obtained by outputting the gradient ∂ _phi L ^D using _∂ φ L ^{D (M) (step} S116h).

The gradient ∂ _φ L ^D is input to the parameter update unit 117f. Parameter updating unit 117f updates the parameter phi by the gradient method using a gradient ∂ _φ L ^D. That parameter updating unit 117b, and outputs the updated parameters φ to minimize L ^D of formula (8). The parameter φ is input to the approximation

function application units

115b and 115c (step S117d).

The control unit 119 determines whether or not the convergence condition is satisfied. The convergence condition, step S114b, S113b, S115b, S116e, S119h, S116f, S117c, S114c, S113c, S115c, S116g, S119e, S116h, and the treatment was repeated a certain number of S117d, θ, φ and ^L M, the change amount of the L ^D is less than a predetermined value, and which may or may not be. If the convergence condition is satisfied, the parameter update unit 117c outputs the parameter θ, the parameter update unit 117d outputs the parameter φ, and the process of step 3 ends (step S119j).

In addition, S114c, S113c, S115c, S116g , S119e, S116h, processing of S117d is, subjective sound quality evaluation ^P of the target sound ^{^{(s (j), s (}} j)) and the input of subjective sound quality evaluation P ^{(s (j)} , ・) Subjective sound quality evaluation of the target sound obtained by inputting the ^{target sound signal s (j)} representing the target sound into _{the first approximation function D φ} (s ^(j) , ・) that approximates the objective index imitating). Objective index D _φ (s ^(j) , s ^(j) ) that imitates the error ε _φ (s ^(j) ) and subjective sound quality evaluation P (s ^(j) , x) of the observed sound based on the target sound. ^(J) ) and the objective index D that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the ^{observation signal x (j)} representing the observed sound into the first approximate function _Dφ (s ^{(j), ·).} Obtained by applying the first mask M (x ^(j) ; θ) to the error ε _φ (x ^(j) _{) between φ} (s ^(j) , x ^(j) ) and the observed signal x ^(j). emphasized sound of subjective sound quality ^P corresponding to the mask following tone signal which is an ^{^{(s (j), y (}} j)), enhanced speech signal representing the emphasized sound to the first approximate function _{^{D φ (s (j),}} ·) obtained by inputting the y ^(j), and the objective index _D φ the subjective sound quality evaluation imitating emphasis sound and the ^{^{(s (j), y (}} j)), the error _{^{ε φ (y (j))}} , of approximation to obtain a first approximation function D _phi to minimize the first cost function ^{L D} based on the sum ^{(s (j),} ·) second approximation to update the function _{^{D φ (s (j),}} ·) Corresponds to the function learning step. In this example, the first mask M (x ^(j) ; θ) is obtained by applying the first model M _θ ^{to the observed signal x (j)} , and the approximation function learning step is the first model M. The first approximation function D _φ (s ^(j) , ·) is updated to minimize the first cost function L ^D _{without updating θ} , and the second approximation function D _φ (s ^(j) , · ·). ) Is the step to obtain. Also, step S114b, S113b, S115b, S116e, S119h, S116f, processing S117c, the second approximate function _{^{D φ (s (i),}} ·) to enter the enhanced speech signal ^{y (i)} representing the emphasized sound Minimize ^{the second cost function LM (i)} based on the expected value of the _{obtained second objective index Dφ} (s ⁽ⁱ⁾ , y ⁽ⁱ⁾ ) that mimics the subjective sound quality evaluation of the emphasized sound. The first model M _θ is updated to obtain the second model M _θ .

<Speech enhancement processing>
_{The information for identifying the model M θ} and the approximate function D _φ (s, ·) learned as described above is stored in the model storage unit 120 of the speech enhancement device 12 (FIG. 4). For example, the parameters θ and φ output in step S119j are stored in the model storage unit 120. Under this premise, the following speech enhancement processing is executed.

As illustrated in FIG. 5, an observation signal x, which is a time-series acoustic signal in the time domain, is input to the input unit 121 of the speech enhancement device 12 (FIG. 4) (step S121).

The observation signal x is input to the frequency domain conversion unit 122. The frequency domain conversion unit 122 obtains and outputs an observation signal X = Q (x) expressing the observation signal x in the time frequency domain by a frequency domain conversion process Q such as a short-time Fourier transform (step S122).

The observation signal x is input to the mask estimation unit 123. The mask estimation unit 123 _{applies the model M θ} to the observation signal x to estimate and output the TF mask M (x; θ) (step S123).

The observation signal X and the TF mask M (x; θ) are input to the mask application unit 124. The mask application unit 124 applies (multiplies) the TF mask M (x; θ) to the observation signal X in the time frequency domain, obtains and outputs the masked audio signal M (x; θ) ◎ X ( Step S124).

After masking, the audio signal M (x; θ) ⊚ X is input to the time domain conversion unit 125. ^{The time domain conversion unit 125 applies a time domain conversion process Q +} such as an inverse FTFT to the masked voice signal M (x; θ) ◎ X to obtain and output the time domain emphasized voice y (Equation (1)). ) (Step S126).

[Validity verification]
In order to verify the effectiveness of the present invention, an experiment was conducted using a public data set of speech enhancement (Non-Patent Document 4). 7, with respect to pre-trained model M _theta and approximation function D φ _(s, ·) in

step

1 and 2, by using a different random number seed, approximate model M _theta Step 3 function D _phi (s The result of the learning process of, ・) is shown. Here, PESQ was used for OSQAS. As can be seen from FIG. 7, it can be seen that the PESQ is steadily improving as the learning progresses. As a result of comparing the PESQ values after the end of learning, the PESQ was 2.86 in the prior art of Non-Patent Document 3, whereas it was 2.93 in the method of the present embodiment. From this, it can be said that this method is effective for learning DNN speech enhancement using OSQAS.

[Hardware configuration]
The learning device 11 and the voice enhancing device 12 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a memory such as a RAM (random-access memory) or a ROM (read-only memory). It is a device configured by executing a predetermined program by a general-purpose or dedicated computer equipped with the above. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. .. Further, the electronic circuit constituting one device may include a plurality of CPUs.

FIG. 8 is a block diagram illustrating the hardware configurations of the learning device 11 and the speech enhancement device 12 in each embodiment. As illustrated in FIG. 8, the learning device 11 and the voice enhancing device 12 of this example include a CPU (Central Processing Unit) 10a, an output unit 10b, an output unit 10c, a RAM (Random Access Memory) 10d, and a ROM (Read Only Memory). ) 10e, an auxiliary storage device 10f, and a bus 10g. The CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac. Further, the output unit 10b is an output terminal, a display, or the like on which data is output. Further, the output unit 10c is a LAN card or the like controlled by the CPU 10a that has read a predetermined program. Further, the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. Further, the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data. There is. Further, the bus 10g connects the CPU 10a, the output unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a. The control unit 10ab of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program. The calculation result is stored in the register 10ac. With such a configuration, the functional configuration of the learning device 11 and the speech enhancement device 12 is realized.

The above program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).

In each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

[Modification example]
The present invention is not limited to the above-described embodiment. For example, OSQAS is not limited to PESQ, and may be any value as long as it is an objective index that imitates the subjective evaluation of human sound quality.

In step 3, the model M _θ was trained first, but in step 3, the approximate function D _φ (s, ·) may be trained first. Although DNN is used in the above-described embodiment, other models such as a probabilistic model may be used.

Further, the various processes described above are not only executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, it goes without saying that changes can be made as appropriate without departing from the spirit of the present invention.

11 Learning device 12 Speech enhancement device

Claims

It imitates the subjective sound quality evaluation of the target sound obtained by inputting the target sound signal representing the target sound into the first approximation function that approximates the subjective sound quality evaluation of the target sound and the objective index that imitates the subjective sound quality evaluation of the input. The error between the objective index and the sound quality
The error between the subjective sound quality evaluation of the observed sound based on the target sound and the objective index imitating the subjective sound quality evaluation of the observed sound obtained by inputting the observation signal representing the observed sound into the first approximation function. ,
The subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound are input to the first approximation function. The error between the objective index that imitates the subjective sound quality evaluation of the emphasized sound, and
A learning method having an approximation function learning step of updating the first approximation function to obtain a second approximation function so as to minimize the first cost function based on the sum of.
The learning method of claim 1.
The first mask is obtained by applying the first model to the observation signal.
The approximate function learning step is a step of updating the first approximate function so as to minimize the first cost function and obtaining the second approximate function without updating the first model. ..
The learning method of claim 2.
Minimize the second cost function based on the expected value of the second objective index, which is obtained by inputting the emphasized audio signal representing the emphasized sound into the second approximate function and imitating the subjective sound quality evaluation of the emphasized sound. A learning method comprising a mask learning step of updating the first model to obtain a second model.
A mask estimation step of applying an observation sound to the second model of claim 3 to estimate a mask,
A speech enhancement method including a mask application step of applying the mask to the observed sound and acquiring a voice signal after masking.
It imitates the subjective sound quality evaluation of the target sound obtained by inputting the target sound signal representing the target sound into the first approximation function that approximates the subjective sound quality evaluation of the target sound and the objective index that imitates the subjective sound quality evaluation of the input. The error between the objective index and the sound quality
The error between the subjective sound quality evaluation of the observed sound based on the target sound and the objective index imitating the subjective sound quality evaluation of the observed sound obtained by inputting the observation signal representing the observed sound into the first approximation function. ,
The subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound are input to the first approximation function. The error between the objective index that imitates the subjective sound quality evaluation of the emphasized sound, and
A learning device having an approximation function learning unit that updates the first approximation function to obtain a second approximation function so as to minimize the first cost function based on the sum of.
A mask estimation unit that estimates a mask by applying an observation sound to the second model of claim 3,
A speech enhancement device having a mask application unit that applies the mask to the observed sound and acquires a voice signal after masking.
A program for causing a computer to execute any of the learning methods of claims 1 to 3.
A program for causing a computer to execute the speech enhancement method of claim 4.