WO2021149213A1 - Learning device, speech emphasis device, methods therefor, and program - Google Patents

Learning device, speech emphasis device, methods therefor, and program Download PDF

Info

Publication number
WO2021149213A1
WO2021149213A1 PCT/JP2020/002270 JP2020002270W WO2021149213A1 WO 2021149213 A1 WO2021149213 A1 WO 2021149213A1 JP 2020002270 W JP2020002270 W JP 2020002270W WO 2021149213 A1 WO2021149213 A1 WO 2021149213A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
quality evaluation
subjective
mask
function
Prior art date
Application number
PCT/JP2020/002270
Other languages
French (fr)
Japanese (ja)
Inventor
悠馬 小泉
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/002270 priority Critical patent/WO2021149213A1/en
Publication of WO2021149213A1 publication Critical patent/WO2021149213A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise

Definitions

  • the present invention relates to a speech enhancement technique.
  • T is an integer greater than or equal to 1.
  • the purpose of speech enhancement is to estimate s from x with high accuracy.
  • the observation signal X Q (x) ⁇ C F ⁇ K in which the observation signal x is expressed in the time domain region by the frequency domain conversion process Q such as short-time Fourier transform.
  • T, F, and K are positive integers, T represents the number of observed signals x belonging to a predetermined time interval (time length), and F represents the number of discrete frequencies belonging to a predetermined band in the time frequency domain (bandwidth). Represents, and K represents the number of discrete times (time length) belonging to a predetermined time interval in the time frequency domain.
  • M (x; ⁇ ) ⁇ Q (x) represents multiplying Q (x) by the TF mask M (x; ⁇ ).
  • is the Hadamard product.
  • is a parameter of DNN, and is usually learned to minimize the signal-to-distortion ratio (SDR) L SDR represented by, for example, the following equation (2).
  • Equation (2) is widely used as a cost function for DNN speech enhancement is that LSDR is differentiable with respect to ⁇ .
  • DNN learning is performed by a gradient method using a gradient obtained by a method called an error back propagation method.
  • ⁇ ⁇ is a differential operator on ⁇ . Since L SDR can be calculated analytically, it can DNN learning efficiently performed.
  • is a positive constant.
  • Non-Patent Document 3 a method of approximating P (s, y) with a differentiable function D ⁇ (s, y) having a parameter ⁇ has been proposed.
  • D ⁇ (s, y) is designed by DNN
  • D ⁇ (s, y) is differentiable with respect to ⁇
  • ⁇ ⁇ D ⁇ (s, y) can be calculated analytically.
  • L M (GAN) (D ⁇ (s, y) ⁇ 1) 2 (7)
  • Non-Patent Document 3 The problem with the prior art described in Non-Patent Document 3 is the stability of learning. In order to proceed with learning so as to stably improve the OSQAS of the test data, the approximation of the equation (5) needs to be highly accurate. However, in this conventional technique, OSQAS is not stably improved even if the number of learnings is increased. Therefore, there is still room for improvement in this prior art.
  • the present invention has been made in view of these points, and an objective index that imitates the subjective evaluation of human sound quality is approximated to a differentiable function with high accuracy, and the learning of the differentiable function is stabilized.
  • the purpose is to make it.
  • An objective index that imitates the subjective sound quality evaluation of the observed sound which is obtained by inputting the error of the index, the subjective sound quality evaluation of the observed sound based on the target sound, and the observation signal representing the observed sound into the first approximation function.
  • the subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound is input to the first approximation function.
  • the first approximation function is updated so as to minimize the first cost function based on the sum of the error of the objective index imitating the subjective sound quality evaluation of the emphasized sound, and the second approximation function is obtained.
  • an objective index that imitates the subjective evaluation of human sound quality can be approximated to a differentiable function with high accuracy, and the learning of the differentiable function can be stabilized.
  • FIG. 1 is a block diagram illustrating the functional configuration of the learning device of the embodiment.
  • FIG. 2 is a diagram for explaining the learning method of the embodiment.
  • FIG. 3 is a diagram for explaining the learning method of the embodiment.
  • FIG. 4 is a block diagram for explaining the functional configuration of the speech enhancement device of the embodiment.
  • FIG. 5 is a diagram for explaining a speech enhancement method of the embodiment. It is a figure for intuitively exemplifying the learning result of the differentiable function which approximates OSQAS. It is a figure for demonstrating the learning result of embodiment.
  • FIG. 8 is a block diagram for explaining a hardware configuration.
  • the present embodiment provides a method for improving the accuracy of the approximation of equation (5) and stabilizing the learning of the differentiable function D ⁇ (s, ⁇ ).
  • a cost function L D instead of the cost function L D (GAN). 6 intuitively illustrate (a) the cost function L D (GAN) is trained to minimize the D ⁇ (s, ⁇ ), which minimizes the cost function L D in (b) Intuitively exemplify the D ⁇ (s, ⁇ ) learned in this way.
  • the solid line represents the true OSQAS, dotted and broken lines indicate D ⁇ (s, ⁇ ) learned.
  • the horizontal axis represents the amount of noise contained in the input observation sound, and the vertical axis represents the PESQ score.
  • D ⁇ (s, ⁇ ) is learned so as to minimize the cost function LD (GAN) , which is (i) OSQAS (P (s, y)) before speech enhancement.
  • GAN cost function
  • D ⁇ (s, ⁇ ) is learned so as to minimize the cost function LD (GAN) , which is (i) OSQAS (P (s, y)) before speech enhancement.
  • the error ⁇ ⁇ (y) with respect to (ii) and D ⁇ (s, ⁇ ) to minimize the error ⁇ ⁇ (s) with respect to OSQAS (P (s, s)) in the case of perfect speech enhancement.
  • an error in OSQAS when speech enhancement fails is not taken into consideration. Therefore, when D ⁇ (s, ⁇ )
  • the mini-batch size M calculates the cost value of the cost function cost function L D below, to learn ⁇ to minimize this.
  • M is a positive integer
  • j 1, ..., M
  • s (j) is the j-th target audio signal
  • x (j) is the j-th observation signal
  • y (j) Is an emphasized audio signal.
  • s (j) , x (j) , y (j) , n (j) are time-series signals of T samples in the time domain, respectively, and the target voice signal s (j) represents the target sound, and the observation signal x (J) represents the observed sound, the emphasized voice signal y (j) represents the emphasized sound, and the noise signal n (j) represents noise. That is, the first approximation function D ⁇ that approximates the subjective sound quality evaluation P (s (j) , s (j) ) of the target sound and the objective index imitating the subjective sound quality evaluation P (s (j), ⁇ ) of the input.
  • (J) , y (j) ) and the first approximation function D ⁇ (s (j) , ⁇ ) are input to the emphasized sound signal y (j), which is obtained by inputting the emphasized sound signal y (j).
  • the learning device 11 of the present embodiment includes storage units 111 and 112, mask estimation application units 113a, mask application units 113b and 113c, model application units 114a to 114c, and approximation function application units 115a to 115c. , The gradient calculation units 116a to 116h, the parameter update units 117a to 117d, the memory 118, and the control unit 119. The learning device 11 executes each process under the control of the control unit 119, and the data obtained in each process is stored in the memory 118, read out as needed, and used for other processes.
  • the speech enhancement device 12 of the present embodiment includes a model storage unit 120, an input unit 121, a frequency domain conversion unit 122, a mask estimation unit 123, a mask application unit 124, a time domain conversion unit 125, and an output. It has a unit 126, a control unit 127, and a memory 128.
  • the speech enhancement device 12 executes each process under the control of the control unit 127, and the data obtained in each process is stored in the memory 128, read out as needed, and used for other processes.
  • learning data is prepared consisting of the observed signal x (j) corresponding to the target speech signal s (j) and the target speech signal s (j).
  • j 1, ..., M
  • M is an integer of 1 or more.
  • the target audio signals s (1) , ..., S (M) are stored in the storage unit 111, and the observation signals x (1) , ..., X (M) are stored in the storage unit 112. Under this premise, the following steps 1, 2 and 3 are executed.
  • Step 1 Pre-learning of model M ⁇
  • the model M ⁇ is pre-learned.
  • the control unit 119 sets the parameter theta model M theta to the initial value (step S119aa).
  • the model application unit 114a extracts the observation signal x (i) from the storage unit 112, applies the model M ⁇ to the observation signal x (i) , obtains the mask M (x (i) ; ⁇ ), and outputs the mask M (x (i); ⁇ ). (Step S114a).
  • the mask M (x (i) ; ⁇ ) is input to the gradient calculation unit 116a.
  • Gradient calculation unit 116a extracts the observed signal x (i) from the storage unit 111, and outputs the resulting gradient ⁇ ⁇ L SDR M (i) .
  • the gradient ⁇ ⁇ L SDR M (1) , ..., ⁇ ⁇ L SDR M (N) is input into the gradient calculation unit 116 b.
  • the gradient calculation unit 116b gradient ⁇ ⁇ L SDR M (1) , ..., and outputs the resulting gradient ⁇ ⁇ L SDR M using ⁇ ⁇ L SDR M (N) .
  • the gradient ⁇ ⁇ L SDR M is input to the parameter update unit 117a.
  • the control unit 119 determines whether or not the convergence condition is satisfied.
  • the convergence condition, step S114a, S116a, S119b, S116b, the processing of S117a is repeated a certain number of times and, the change amount of ⁇ and L SDR M is equal to or less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119ab. If the convergence condition is satisfied, the parameter update unit 117a outputs the parameter ⁇ and ends the process of step 1 (step S119c).
  • Step 2 Pre-learning of the approximate function D ⁇ (s, ⁇ ) >>
  • the approximation function D ⁇ (s, ⁇ ) is pre-learned.
  • the process of step 2 will be described with reference to FIG.
  • the control unit 119 sets the parameter ⁇ of the approximation function D ⁇ (s, ⁇ ) to the initial value (step S119da).
  • the parameter ⁇ obtained in step 1 is input to the mask estimation application unit 113a.
  • the mask estimation application unit 113a extracts the observation signal x (j) from the storage unit 112 and applies the model M ⁇ to the observation signal x (j) to obtain the mask M (x (j) ; ⁇ ). Further, the mask estimator application unit 113a, the mask M to the observed signal x (j) (x (j ); ⁇ ) by applying, to emphasize the voice signal y (j) the obtained output as in equation (1b) (Step S113a).
  • y (j) Q + (M (x (j) ; ⁇ ) ⁇ Q (x (j) )) (1b)
  • the observed signal x (j) and the emphasized audio signal y (j) are input to the approximation function application unit 115a.
  • the approximation function application unit 115a further extracts the target audio signal s (j) from the storage unit 111.
  • the approximation function application unit 115a inputs s (j) , x (j) , y (j) into the approximation function D ⁇ (s, ⁇ ) and D ⁇ (s (j) , s (j) ), D. ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are obtained and output (step S115a).
  • D ⁇ (s (j) , s (j) ), D ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are input to the gradient calculation unit 116c. .. Further calculated P (s (j) , s (j) ), P (s (j) , x (j) ), P (s (j) , y (j) ) are stored in the gradient calculation unit 116c. Entered. Next, the gradient calculation unit 116c has an error between P (s (j) , s (j) ) and D ⁇ (s (j) , s (j) ) ⁇ (s (j) ), P (s (j)).
  • the gradient ⁇ ⁇ L D (1), ..., ⁇ ⁇ L D (M) is input into the gradient calculation unit 116d.
  • Gradient calculation unit 116d gradient ⁇ ⁇ L D (1), ..., and outputs the resulting gradient ⁇ phi L D using ⁇ ⁇ L D (M).
  • the gradient ⁇ ⁇ L D is input to the parameter update unit 117b.
  • Parameter updating unit 117b updates the parameter phi by the gradient method using a gradient ⁇ ⁇ L D. That parameter updating unit 117b outputs the L D of formula (8) by updating the parameters ⁇ to minimize (step S117b).
  • the control unit 119 determines whether or not the convergence condition is satisfied.
  • the convergence condition, step S115a, S116c, S119e, S116d, the processing of S117b is repeated a certain number of times and, the change amount of the ⁇ and L D is less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119db. If the convergence condition is satisfied, the parameter update unit 117b outputs the parameter ⁇ and ends the process of step 2 (step S119f).
  • step 2 the first approximation that approximates the subjective sound quality evaluation P (s (j) , s (j) ) of the target sound and the objective index that imitates the subjective sound quality evaluation P (s (j), ⁇ ) of the input.
  • An objective index D ⁇ (s (j) , s) that imitates the subjective sound quality evaluation of the target sound which is obtained by inputting the target sound signal s (j) representing the target sound to the function D ⁇ (s (j), ⁇ ). (J) ), the error ⁇ ⁇ (s (j) ), the subjective sound quality evaluation P (s (j) , x (j) ) of the observed sound based on the target sound, and the first approximation function D ⁇ (s).
  • An objective index D ⁇ (s (j) , x (j) ) that imitates the subjective sound quality evaluation of the observed sound which is obtained by inputting the observation signal x (j) representing the observed sound into (j), ⁇ ), Subjective sound quality evaluation of the emphasized sound corresponding to the error ⁇ ⁇ (x (j) ) and the post-masked sound signal obtained by applying the first mask M (x (j) ; ⁇ ) to the observed signal x (j).
  • the first mask M (x (j) ; ⁇ ) is obtained by applying the first model M ⁇ to the observed signal x (j)
  • the approximation function learning step is the first model M.
  • the first approximation function D ⁇ (s (j) , ⁇ ) is updated to minimize the first cost function L D without updating ⁇
  • the second approximation function D ⁇ (s (j) , ⁇ ⁇ ). ) Is the step to obtain.
  • control unit 119 sets the parameters ⁇ and ⁇ obtained in steps 1 and 2 to the initial values (step S119ga).
  • the model application unit 114b extracts the observation signal x (i) from the storage unit 112, applies the model M ⁇ to the observation signal x (i) , obtains the mask M (x (i) ; ⁇ ), and outputs the mask M (x (i); ⁇ ). (Step S114b).
  • the mask M (x (i) ; ⁇ ) is input to the mask application unit 113b.
  • Q (x (i) ) is further input to the mask application unit 113b, and the mask application unit 113b obtains and outputs the emphasized audio signal y (i) according to the equation (1a) (step S113b).
  • the emphasized audio signal y (i) is input to the approximation function application unit 115b.
  • the approximation function application unit 115b further extracts the target audio signal s (i) from the storage unit 111.
  • the approximate function application unit 115a inputs s (i) and y (i) into the approximate function D ⁇ (s, ⁇ ) to obtain D ⁇ (s (i) , y (i) ) and outputs it (step). S115b).
  • D ⁇ (s (i) , y (i) ) is input to the gradient calculation unit 116e.
  • Gradient calculation unit 116e and outputs the resulting gradient ⁇ ⁇ L M (i).
  • LM (i) satisfies the following equation (4b) (step S116e).
  • L M (i) -E [D ⁇ (s (i) , y (i) )] x, y (4b)
  • the gradient ⁇ ⁇ L M (1), ..., ⁇ ⁇ L M (N) is input into the gradient calculation unit 116f.
  • the gradient calculation unit 116f gradient ⁇ ⁇ L M (1), ..., and outputs the resulting gradient ⁇ phi L M using ⁇ ⁇ L M (N).
  • Gradient ⁇ phi L M is input to the parameter update unit 117c.
  • the updated parameter ⁇ is input to the model application units 114b and 114c (step S117c).
  • the model application unit 114c extracts the observation signal x (j) from the storage unit 112, applies the model M ⁇ to the observation signal x (j) , obtains the mask M (x (j) ; ⁇ ), and outputs the mask M (x (j); ⁇ ). (Step S114c).
  • the mask M (x (j) ; ⁇ ) is input to the mask application unit 113c.
  • Q (x (j) ) is further input to the mask application unit 113c, and the mask application unit 113c obtains and outputs the emphasized audio signal y (j) according to the following equation (1a') (step S113b).
  • y (j) Q + (M (x (j) ; ⁇ ) ⁇ Q (x (j) )) (1a')
  • the emphasized audio signal y (j) is input to the approximation function application unit 115c. Further, the approximation function application unit 115c extracts the target audio signal s (j) from the storage unit 111, and extracts the observation signal x (j) from the storage unit 112. The approximation function application unit 115c inputs s (j) , x (j) , y (j) into the approximation function D ⁇ (s, ⁇ ) and D ⁇ (s (j) , s (j) ), D. ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are obtained and output (step S115c).
  • D ⁇ (s (j) , s (j) ), D ⁇ (s (j) , x (j) ), D ⁇ (s (j) , y (j) ) are input to the gradient calculation unit 116g. .. Further calculated P (s (j) , s (j) ), P (s (j) , x (j) ), P (s (j) , y (j) ) are stored in the gradient calculation unit 116g. Entered. Next, the gradient calculation unit 116g has an error between P (s (j) , s (j) ) and D ⁇ (s (j) , s (j) ) ⁇ (s (j) ), P (s (j)).
  • the gradient ⁇ ⁇ L D (1), ..., ⁇ ⁇ L D (M) is input into the gradient calculation unit 116h.
  • the gradient ⁇ ⁇ L D is input to the parameter update unit 117f.
  • Parameter updating unit 117f updates the parameter phi by the gradient method using a gradient ⁇ ⁇ L D. That parameter updating unit 117b, and outputs the updated parameters ⁇ to minimize L D of formula (8).
  • the parameter ⁇ is input to the approximation function application units 115b and 115c (step S117d).
  • the control unit 119 determines whether or not the convergence condition is satisfied.
  • the convergence condition step S114b, S113b, S115b, S116e, S119h, S116f, S117c, S114c, S113c, S115c, S116g, S119e, S116h, and the treatment was repeated a certain number of S117d, ⁇ , ⁇ and L M, the change amount of the L D is less than a predetermined value, and which may or may not be. If the convergence condition is satisfied, the parameter update unit 117c outputs the parameter ⁇ , the parameter update unit 117d outputs the parameter ⁇ , and the process of step 3 ends (step S119j).
  • S114c, S113c, S115c, S116g , S119e, S116h, processing of S117d is, subjective sound quality evaluation P of the target sound (s (j), s ( j)) and the input of subjective sound quality evaluation P (s (j) , ⁇ ) Subjective sound quality evaluation of the target sound obtained by inputting the target sound signal s (j) representing the target sound into the first approximation function D ⁇ (s (j) , ⁇ ) that approximates the objective index imitating).
  • the first mask M (x (j) ; ⁇ ) is obtained by applying the first model M ⁇ to the observed signal x (j) , and the approximation function learning step is the first model M.
  • the first approximation function D ⁇ (s (j) , ⁇ ) is updated to minimize the first cost function L D without updating ⁇
  • the first model M ⁇ is updated to obtain the second model M ⁇ .
  • ⁇ Speech enhancement processing> The information for identifying the model M ⁇ and the approximate function D ⁇ (s, ⁇ ) learned as described above is stored in the model storage unit 120 of the speech enhancement device 12 (FIG. 4). For example, the parameters ⁇ and ⁇ output in step S119j are stored in the model storage unit 120. Under this premise, the following speech enhancement processing is executed.
  • an observation signal x which is a time-series acoustic signal in the time domain, is input to the input unit 121 of the speech enhancement device 12 (FIG. 4) (step S121).
  • the observation signal x is input to the frequency domain conversion unit 122.
  • the observation signal x is input to the mask estimation unit 123.
  • the mask estimation unit 123 applies the model M ⁇ to the observation signal x to estimate and output the TF mask M (x; ⁇ ) (step S123).
  • the observation signal X and the TF mask M (x; ⁇ ) are input to the mask application unit 124.
  • the mask application unit 124 applies (multiplies) the TF mask M (x; ⁇ ) to the observation signal X in the time frequency domain, obtains and outputs the masked audio signal M (x; ⁇ ) ⁇ X ( Step S124).
  • the audio signal M (x; ⁇ ) ⁇ X is input to the time domain conversion unit 125.
  • the time domain conversion unit 125 applies a time domain conversion process Q + such as an inverse FTFT to the masked voice signal M (x; ⁇ ) ⁇ X to obtain and output the time domain emphasized voice y (Equation (1)). ) (Step S126).
  • the learning device 11 and the voice enhancing device 12 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a memory such as a RAM (random-access memory) or a ROM (read-only memory). It is a device configured by executing a predetermined program by a general-purpose or dedicated computer equipped with the above.
  • This computer may have one processor and memory, or may have a plurality of processors and memory.
  • This program may be installed in a computer or may be recorded in a ROM or the like in advance.
  • a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. ..
  • the electronic circuit constituting one device may include a plurality of CPUs.
  • FIG. 8 is a block diagram illustrating the hardware configurations of the learning device 11 and the speech enhancement device 12 in each embodiment.
  • the learning device 11 and the voice enhancing device 12 of this example include a CPU (Central Processing Unit) 10a, an output unit 10b, an output unit 10c, a RAM (Random Access Memory) 10d, and a ROM (Read Only Memory). ) 10e, an auxiliary storage device 10f, and a bus 10g.
  • the CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac.
  • the output unit 10b is an output terminal, a display, or the like on which data is output.
  • the output unit 10c is a LAN card or the like controlled by the CPU 10a that has read a predetermined program.
  • the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored.
  • the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data.
  • the bus 10g connects the CPU 10a, the output unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged.
  • the CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program.
  • the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a.
  • the control unit 10ab of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program.
  • the calculation result is stored in the register 10ac.
  • the above program can be recorded on a computer-readable recording medium.
  • a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded.
  • the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.
  • the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program.
  • a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially.
  • the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
  • the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
  • OSQAS is not limited to PESQ, and may be any value as long as it is an objective index that imitates the subjective evaluation of human sound quality.
  • step 3 the model M ⁇ was trained first, but in step 3, the approximate function D ⁇ (s, ⁇ ) may be trained first.
  • DNN is used in the above-described embodiment, other models such as a probabilistic model may be used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

In this invention, a second approximation function is obtained by updating a first approximation function that approximates an objective index simulating the subjective sound-quality evaluation of an input so as to minimize a first cost function that is based on the sum of: an error between the subjective sound-quality evaluation of a target sound and an objective index that simulates the subjective sound-quality evaluation of the target sound and is obtained by inputting a target speech signal representing the target sound into a first approximation function; an error between the subjective sound-quality evaluation of an observation sound based on the target sound and an objective index that simulates the subjective sound-quality evaluation of the observation sound and is obtained by inputting an observation signal representing the observation sound into the first approximation function; and an error between the subjective sound-quality evaluation of an emphatic sound that corresponds to a masked sound signal obtained by applying a first mask to the observation signal and an objective index that simulates the subjective sound-quality evaluation of the emphatic sound and is obtained by inputting an emphatic speech signal representing the emphatic sound into the first approximation function.

Description

学習装置、音声強調装置、それらの方法、およびプログラムLearning devices, speech enhancement devices, their methods, and programs
 本発明は、音声強調技術に関する。 The present invention relates to a speech enhancement technique.
 Tサンプルの時間領域での観測信号x∈Rは、目的音声信号sと雑音信号nの混合信号x=s+nであるとする。Tは1以上の整数である。音声強調の目的は、xからsを高精度に推定することである。式(1)に例示するように、DNN音声強調では、短時間フーリエ変換などの周波数領域変換処理Qによって観測信号xを時間周波数領域表現した観測信号X=Q(x)∈CF×Kを得、XにDNNを利用して推定した時間周波数(T-F)マスクM(x;θ)を乗じてマスク後音声信号M(x;θ)◎Q(x)を得、さらにマスク後音声信号M(x;θ)◎Q(x)に対し、逆STFTなどの時間領域変換処理Q+を適用して強調音声信号yを得る。
 y=Q+(M(x;θ)◎Q(x))    (1)
ここで、Rは実数全体の集合を表し、Cは複素数全体の集合を表す。T,F,Kは正整数であり、Tは所定の時間区間に属する観測信号xの個数(時間長)を表し、Fは時間周波数領域の所定の帯域に属する離散周波数の個数(帯域幅)を表し、Kは時間周波数領域の所定の時間区間に属する離散時間の個数(時間長)を表す。M(x;θ)◎Q(x)は、Q(x)にT-FマスクM(x;θ)を乗じることを表す。◎はアダマール積である。θはDNNのパラメータであり、通常は例えば以下の式(2)で表される信号対歪比(SDR: signal-to-distortion ratio)LSDRを最小化するように学習される。
 LSDR = -(clipβ[SDR(s,y)]+clipβ[SDR(n,m)])/2    (2)
ただし、
Figure JPOXMLDOC01-appb-M000001
であり、
Figure JPOXMLDOC01-appb-M000002
はLノルムであり、m=x-yであり、clipβ[χ]=β・tanh(χ/β)であり、β>0はクリッピング定数である。例えば、β=20である(例えば、非特許文献1から4等参照)。
It is assumed that the observed signal x ∈ RT in the time domain of the T sample is a mixed signal x = s + n of the target audio signal s and the noise signal n. T is an integer greater than or equal to 1. The purpose of speech enhancement is to estimate s from x with high accuracy. As illustrated in Equation (1), in DNN voice enhancement, the observation signal X = Q (x) ∈ C F × K in which the observation signal x is expressed in the time domain region by the frequency domain conversion process Q such as short-time Fourier transform. Obtained, and multiply X by the time frequency (TF) mask M (x; θ) estimated using DNN to obtain the post-masked audio signal M (x; θ) ◎ Q (x), and further multiply the post-masked audio signal M (x; θ). A time domain conversion process Q + such as an inverse FTFT is applied to the signal M (x; θ) ⊚ Q (x) to obtain an emphasized audio signal y.
y = Q + (M (x; θ) ◎ Q (x)) (1)
Here, R represents the set of all real numbers, and C represents the set of all complex numbers. T, F, and K are positive integers, T represents the number of observed signals x belonging to a predetermined time interval (time length), and F represents the number of discrete frequencies belonging to a predetermined band in the time frequency domain (bandwidth). Represents, and K represents the number of discrete times (time length) belonging to a predetermined time interval in the time frequency domain. M (x; θ) ◎ Q (x) represents multiplying Q (x) by the TF mask M (x; θ). ◎ is the Hadamard product. θ is a parameter of DNN, and is usually learned to minimize the signal-to-distortion ratio (SDR) L SDR represented by, for example, the following equation (2).
L SDR =-(clip β [SDR (s, y)] + clip β [SDR (n, m)]) / 2 (2)
However,
Figure JPOXMLDOC01-appb-M000001
And
Figure JPOXMLDOC01-appb-M000002
Is L 2 norm, is m = x-y, clip β [χ] = a β · tanh (χ / β) , β> 0 is a clipping constant. For example, β = 20 (see, for example, Non-Patent Documents 1 to 4 and the like).
 式(2)がDNN音声強調のコスト関数として広く用いられる理由は、LSDRがθに関して微分可能だからである。一般に、DNNの学習は誤差逆伝搬法といわれる方法で求められた勾配を利用した勾配法で学習される。
Figure JPOXMLDOC01-appb-M000003
ここで∂θはθに関する微分演算子である。LSDRが解析的に計算できるため、DNNの学習は効率的に行うことができる。λは正の定数である。
The reason that equation (2) is widely used as a cost function for DNN speech enhancement is that LSDR is differentiable with respect to θ. In general, DNN learning is performed by a gradient method using a gradient obtained by a method called an error back propagation method.
Figure JPOXMLDOC01-appb-M000003
Here θ is a differential operator on θ. Since L SDR can be calculated analytically, it can DNN learning efficiently performed. λ is a positive constant.
 SDRなどの解析的に勾配が求まる関数は、主観的な音質評価とは必ずしも一致しないことが知られている。そのため、LSDRを最小化するようにθを学習すると、SDRは減少するが音質が悪化することがある。これを解決するために、perceptual evaluation of speech quality (PESQ)などの、人間の主観的な音質評価を模した客観指標(OSQAS: objective sound quality assessment score)をコスト関数に利用する方法がある(非特許文献1~3)。いま、P(s,y)をsとyから計算されるOSQASとすると、コスト関数は以下となる。
LP= -E[P(s,y)]x,y   (4)
ここでE[p(x)]はp(x)のxに関する期待値である。このコスト関数の問題点は、多くのOSQASのθに関する勾配∂θが解析的に計算できない点にある。
It is known that an analytically gradient function such as SDR does not always match the subjective sound quality evaluation. Therefore, when learning the θ to minimize L SDR, SDR is reduced it is possible sound quality is deteriorated. To solve this, there is a method of using an objective sound quality assessment score (OSQAS) that imitates the subjective evaluation of human sound quality (OSQAS) as a cost function, such as the patentual evaluation of speech quality (PESQ). Patent Documents 1 to 3). Now, assuming that P (s, y) is OSQAS calculated from s and y, the cost function is as follows.
L P = -E [P (s, y)] x, y (4)
Here, E [p (x)] x is the expected value of p (x) with respect to x. The problem with this cost function is that the gradient ∂ θ L P with respect to θ of many OSQAS cannot be calculated analytically.
 そこで非特許文献3に記載された従来技術では、P(s,y)を、パラメータΦを持つ微分可能な関数Dφ(s,y)で近似する方法が提案された。
Figure JPOXMLDOC01-appb-M000004
ここで、例えばDφ(s,y)をDNNで設計すれば、Dφ(s,y)は・に関して微分可能であり、∂θφ(s,y)を解析的に計算できる。これを実現するために、まずDφ(s,y)を以下のコスト関数を最小化するように学習する。
LD(GAN) = εφ(s) + εφ(y)   (6)
ただしεφ(・)は、真のOSQASと、推定されたOSQASとの二乗誤差εφ(・)= (P(s,・) - Dφ(s,・))2である。つまり、この従来技術では、(i)現在のOSQASの誤差εφ(y)と、(ii)完璧な音声強調をした場合のOSQASの誤差εφ(s)を最小化するようにDφ(s,・)を学習する。次に、以下のコスト関数LM(GAN)を最小化するように、観測信号xから時間周波数マスクM(x;θ)を生成するモデルMθを学習する。
 LM(GAN) = (Dφ(s,y) - 1)2   (7)
ただし、OSQASは0≦P(s,y)≦1となるように正規化されており、P(s,s)=1であることを仮定している。
Therefore, in the prior art described in Non-Patent Document 3, a method of approximating P (s, y) with a differentiable function D φ (s, y) having a parameter Φ has been proposed.
Figure JPOXMLDOC01-appb-M000004
Here, for example, if D φ (s, y) is designed by DNN, D φ (s, y) is differentiable with respect to ·, and ∂ θ D φ (s, y) can be calculated analytically. To achieve this, we first learn Dφ (s, y) to minimize the following cost function.
L D (GAN) = ε φ (s) + ε φ (y) (6)
Where ε φ (・) is the squared error between the true OSQAS and the estimated OSQAS ε φ (・) = (P (s, ・) - D φ (s, ・)) 2 . In other words, this prior art, (i) the error epsilon phi of the current OSQAS (y), (ii) to minimize the error epsilon phi (s) is the OSQAS in the case of a perfect speech enhancement D phi ( Learn s, ・). Next, the model M θ that generates the time frequency mask M (x; θ) from the observation signal x is learned so as to minimize the following cost function LM (GAN).
L M (GAN) = (D φ (s, y) ―― 1) 2 (7)
However, OSQAS is normalized so that 0 ≦ P (s, y) ≦ 1, and it is assumed that P (s, s) = 1.
 非特許文献3に記載された従来技術の問題点は、学習の安定性にある。テストデータのOSQASを安定して向上させるように学習を進めるためには、式(5)の近似が高精度である必要がある。しかし、この従来技術では、学習回数が増えても安定してOSQASが向上していない。ゆえに、この従来技術には依然として改善の余地がある。 The problem with the prior art described in Non-Patent Document 3 is the stability of learning. In order to proceed with learning so as to stably improve the OSQAS of the test data, the approximation of the equation (5) needs to be highly accurate. However, in this conventional technique, OSQAS is not stably improved even if the number of learnings is increased. Therefore, there is still room for improvement in this prior art.
 本発明はこのような点に鑑みてなされたものであり、人間の主観的な音質評価を模した客観指標を、微分可能な関数に高精度に近似し、当該微分可能な関数の学習を安定化させることを目的とする。 The present invention has been made in view of these points, and an objective index that imitates the subjective evaluation of human sound quality is approximated to a differentiable function with high accuracy, and the learning of the differentiable function is stabilized. The purpose is to make it.
 目的音の主観音質評価と、入力の主観音質評価を模した客観指標を近似する第1近似関数に目的音を表す目的音声信号を入力して得られる、目的音の主観音質評価を模した客観指標と、の誤差と、目的音に基づく観測音の主観音質評価と、第1近似関数に観測音を表す観測信号を入力して得られる、観測音の主観音質評価を模した客観指標と、の誤差と、観測信号に第1マスクを適用して得られるマスク後音信号に対応する強調音の主観音質評価と、第1近似関数に強調音を表す強調音声信号を入力して得られる、強調音の主観音質評価を模した客観指標と、の誤差と、の和に基づく第1コスト関数を最小化するように第1近似関数を更新して第2近似関数を得る。 Approximate the subjective sound quality evaluation of the target sound and the objective index that imitates the subjective sound quality evaluation of the input Objective that imitates the subjective sound quality evaluation of the target sound obtained by inputting the target sound signal representing the target sound into the first approximation function. An objective index that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the error of the index, the subjective sound quality evaluation of the observed sound based on the target sound, and the observation signal representing the observed sound into the first approximation function. And the subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound is input to the first approximation function. The first approximation function is updated so as to minimize the first cost function based on the sum of the error of the objective index imitating the subjective sound quality evaluation of the emphasized sound, and the second approximation function is obtained.
 本発明では、人間の主観的な音質評価を模した客観指標を、微分可能な関数に高精度に近似し、当該微分可能な関数の学習を安定化させることができる。 In the present invention, an objective index that imitates the subjective evaluation of human sound quality can be approximated to a differentiable function with high accuracy, and the learning of the differentiable function can be stabilized.
図1は実施形態の学習装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating the functional configuration of the learning device of the embodiment. 図2は実施形態の学習方法を説明するための図である。FIG. 2 is a diagram for explaining the learning method of the embodiment. 図3は実施形態の学習方法を説明するための図である。FIG. 3 is a diagram for explaining the learning method of the embodiment. 図4は実施形態の音声強調装置の機能構成を説明するためのブロック図である。FIG. 4 is a block diagram for explaining the functional configuration of the speech enhancement device of the embodiment. 図5は実施形態の音声強調方法を説明するための図である。FIG. 5 is a diagram for explaining a speech enhancement method of the embodiment. OSQASを近似する微分可能な関数の学習結果を直感的に例示するための図である。It is a figure for intuitively exemplifying the learning result of the differentiable function which approximates OSQAS. 実施形態の学習結果を説明するための図である。It is a figure for demonstrating the learning result of embodiment. 図8は、ハードウェア構成を説明するためのブロック図である。FIG. 8 is a block diagram for explaining a hardware configuration.
 以下、図面を参照して本発明の実施形態を説明する。
 [原理]
 まず原理を説明する。本実施形態では、式(5)の近似の高精度化、および微分可能な関数Dφ(s,・)の学習の安定化のための方法を提供する。本実施形態では、コスト関数LD(GAN)に代えてコスト関数Lを用いる。図6に、(a)にコスト関数LD(GAN)を最小化するように学習されたDφ(s,・)を直感的に例示し、(b)にコスト関数Lを最小化するように学習されたDφ(s,・)を直感的に例示する。実線は真のOSQASを示し、点線および破線は学習されたDφ(s,・)を示す。横軸は入力された観測音に含まれた雑音の多さを表し、縦軸はPESQスコアを表す。
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[principle]
First, the principle will be explained. The present embodiment provides a method for improving the accuracy of the approximation of equation (5) and stabilizing the learning of the differentiable function Dφ (s, ·). In this embodiment, a cost function L D instead of the cost function L D (GAN). 6, intuitively illustrate (a) the cost function L D (GAN) is trained to minimize the D φ (s, ·), which minimizes the cost function L D in (b) Intuitively exemplify the D φ (s, ·) learned in this way. The solid line represents the true OSQAS, dotted and broken lines indicate D φ (s, ·) learned. The horizontal axis represents the amount of noise contained in the input observation sound, and the vertical axis represents the PESQ score.
 非特許文献では、コスト関数LD(GAN)を最小化するようにDφ(s,・)を学習しているが、これは(i)音声強調前のOSQAS(P(s,y))に対する誤差εφ(y)と、(ii)完璧な音声強調をした場合のOSQAS(P(s,s))に対する誤差εφ(s)を最小化するようにDφ(s,・)を学習することである。ここで、(iii)音声強調を失敗した時のOSQASに対する誤差は考慮されていない。ゆえに、従来技術でDφ(s,・)を学習した場合、図6(a)の点線のようにも破線のようにもなりえる。破線のようにDφ(s,・)が学習された場合、P(s,y)の位置でモデルMθが停留点にいるため学習が進まないし、最悪の場合はOSQASが悪化する方向に学習が進んでしまう。そこで本実施形態では、(iii)の観点も考慮に入れたコスト関数Lを最小化するようにDφ(s,・)を学習する。(iii)音声強調を失敗した時のOSQASを得ることは困難であるため、音声強調を失敗した時のOSQASの代わりに音声強調を行う前のOSQAS(P(s,x))を利用する。すなわち、(i)音声強調前のOSQAS(P(s,y))に対する誤差εφ(y)と、(ii)完璧な音声強調をした場合のOSQAS(P(s,s))に対する誤差εφ(s)と、(iii)音声強調を失敗した時のOSQAS(P(s,x))に対する誤差εφ(x)を最小化するようにDφ(s,・)を学習する。すなわち、図6(b)のように、P(s,s),P(s,y),P(s,x)の3点でDφ(s,s),Dφ(s,y),Dφ(s,x)がそれぞれ近似するように学習を行う。例えば、Mをミニバッチサイズとして、以下のコスト関数コスト関数Lのコスト値を計算し、これを最小化するようにφを学習する。
Figure JPOXMLDOC01-appb-M000005
ただし、Mは正整数であり、j=1,…,Mであり、s(j)はj番目の目的音声信号であり、x(j)はj番目の観測信号であり、y(j)は強調音声信号である。j番目の雑音信号n(j)に対して、観測信号x(j)は目的音声信号s(j)と雑音信号n(j)の混合信号x(j)=s(j)+n(j)である。s(j),x(j),y(j),n(j)はそれぞれ時間領域でのTサンプルの時系列信号であり、目的音声信号s(j)は目的音を表し、観測信号x(j)は観測音を表し、強調音声信号y(j)は強調音を表し、雑音信号n(j)は雑音を表す。すなわち、目的音の主観音質評価P(s(j),s(j))と、入力の主観音質評価P(s(j),・)を模した客観指標を近似する第1近似関数Dφ(s(j),・)に目的音を表す目的音声信号s(j)を入力して得られる、目的音の主観音質評価を模した客観指標Dφ(s(j),s(j))と、の誤差εφ(s(j))と、目的音に基づく観測音の主観音質評価P(s(j),x(j))と、第1近似関数Dφ(s(j),・)に観測音を表す観測信号x(j)を入力して得られる、観測音の主観音質評価を模した客観指標Dφ(s(j),x(j))と、の誤差εφ(x(j))と、観測信号x(j)に第1マスクM(x(j);θ)を適用して得られるマスク後音信号に対応する強調音の主観音質評価P(s(j),y(j))と、第1近似関数Dφ(s(j),・)に強調音を表す強調音声信号y(j)を入力して得られる、強調音の主観音質評価を模した客観指標Dφ(s(j),y(j))と、の誤差εφ(y(j))と、の和に基づく第1コスト関数Lを最小化するように第1近似関数Dφ(s(j),・)を更新する。
In non-patent documents, D φ (s, ·) is learned so as to minimize the cost function LD (GAN) , which is (i) OSQAS (P (s, y)) before speech enhancement. The error ε φ (y) with respect to (ii) and D φ (s, ·) to minimize the error ε φ (s) with respect to OSQAS (P (s, s)) in the case of perfect speech enhancement. To learn. Here, (iii) an error in OSQAS when speech enhancement fails is not taken into consideration. Therefore, when (s, ·) is learned by the conventional technique, it can be as shown by a dotted line or a broken line in FIG. 6 (a). When D φ (s, ·) is learned as shown by the broken line, the learning does not proceed because the model M θ is at the stationary point at the position of P (s, y), and in the worst case, OSQAS deteriorates. Learning progresses. Therefore, in this embodiment learns the D φ (s, ·) to minimize the cost function L D, taking into consideration the perspective of (iii). (iii) Since it is difficult to obtain OSQAS when speech enhancement fails, OSQAS (P (s, x)) before speech enhancement is used instead of OSQAS when speech enhancement fails. That is, (i) the error ε φ (y) with respect to OSQAS (P (s, y)) before speech enhancement, and (ii) the error ε with respect to OSQAS (P (s, s)) with perfect speech enhancement. Learn φ (s) and (iii) D φ (s, ·) so as to minimize the error ε φ (x) with respect to OSQAS (P (s, x)) when speech enhancement fails. That is, as shown in FIG. 6 (b), D φ (s, s) and D φ (s, y) at the three points P (s, s), P (s, y), and P (s, x). , D φ (s, x) are approximated to each other. For example, the mini-batch size M, calculates the cost value of the cost function cost function L D below, to learn φ to minimize this.
Figure JPOXMLDOC01-appb-M000005
However, M is a positive integer, j = 1, ..., M, s (j) is the j-th target audio signal, x (j) is the j-th observation signal, and y (j). Is an emphasized audio signal. against the j-th noise signal n (j), the observed signal x (j) is the target speech signal s (j) and the noise signal mixed signal x n (j) (j) = s (j) + n (j) Is. s (j) , x (j) , y (j) , n (j) are time-series signals of T samples in the time domain, respectively, and the target voice signal s (j) represents the target sound, and the observation signal x (J) represents the observed sound, the emphasized voice signal y (j) represents the emphasized sound, and the noise signal n (j) represents noise. That is, the first approximation function D φ that approximates the subjective sound quality evaluation P (s (j) , s (j) ) of the target sound and the objective index imitating the subjective sound quality evaluation P (s (j), ·) of the input. An objective index D φ (s (j) , s (j) ) that imitates the subjective sound quality evaluation of the target sound, which is obtained by inputting the target sound signal s (j) representing the target sound into (s (j), ·). ), The error ε φ (s (j) ), the subjective sound quality evaluation P (s (j) , x (j) ) of the observed sound based on the target sound, and the first approximation function D φ (s (j)). , ・) The error ε with the objective index Dφ (s (j) , x (j) ) that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the observation signal x (j) representing the observed sound. Subjective sound quality evaluation P (s) of emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask M (x (j) ; θ) to φ (x (j) ) and the observation signal x (j). (J) , y (j) ) and the first approximation function D φ (s (j) , ·) are input to the emphasized sound signal y (j), which is obtained by inputting the emphasized sound signal y (j). objective indices D imitating a φ (s (j), y (j)) and, the error ε φ (y (j)) and the first so as to minimize the first cost function L D based on the sum of Update the approximation function D φ (s (j) , ·).
 また、事前学習を利用することもできる。SDRとOSQASは完全に一致しないが、SDRが高いとOSQASが高くなる傾向がある。コスト関数Lを利用したDφ(s,・)の学習は不安定になりやすいため、例えば、まず式(2)のコスト関数LSDRを利用してモデルMθを学習し、その後モデルMθを更新せずに式(8)のコスト関数Lを利用してDφ(s,・)だけを学習する。その後、最終調整として、式(4)のP(s,y)をDφ(s,y)に置き換えた式(4a)のコスト関数Lと式(8)のコスト関数Lとを交互に用いた交互最適化を利用してパラメータφ,θを更新してもよい。
LP = -E[Dφ(s,y)]x,y   (4a)
Pre-learning can also be used. Although SDR and OSQAS do not completely match, the higher the SDR, the higher the OSQAS tends to be. Since learning of Dφ (s, ·) using the cost function L D tends to be unstable, for example, first, the model M θ is learned using the cost function L SDR of the equation (2), and then the model M θ. without updating the θ using the cost function L D of the formula (8) to learn D φ (s, ·) only. Thereafter, as the final adjustment, P (s, y) of formula (4) the D phi (s, y) and a cost function L D cost function L P and of formula (4a) is replaced with (8) alternately The parameters φ and θ may be updated using the alternating optimization used in.
L P = -E [D φ (s, y)] x, y (4a)
 [第1実施形態]
 次に本発明の第1実施形態を説明する。
 <構成>
 図1に例示するように、本実施形態の学習装置11は、記憶部111,112、マスク推定適用部113a、マスク適用部113b,113c、モデル適用部114a~114c、近似関数適用部115a~115c、勾配計算部116a~116h、パラメータ更新部117a~117d、メモリ118、および制御部119を有する。学習装置11は、制御部119の制御の下で各処理を実行し、各処理で得られたデータはメモリ118に格納され、必要に応じて読み出されて他の処理に使用される。
[First Embodiment]
Next, the first embodiment of the present invention will be described.
<Structure>
As illustrated in FIG. 1, the learning device 11 of the present embodiment includes storage units 111 and 112, mask estimation application units 113a, mask application units 113b and 113c, model application units 114a to 114c, and approximation function application units 115a to 115c. , The gradient calculation units 116a to 116h, the parameter update units 117a to 117d, the memory 118, and the control unit 119. The learning device 11 executes each process under the control of the control unit 119, and the data obtained in each process is stored in the memory 118, read out as needed, and used for other processes.
 図4に例示するように、本実施形態の音声強調装置12は、モデル記憶部120、入力部121、周波数領域変換部122、マスク推定部123、マスク適用部124、時間領域変換部125、出力部126、制御部127、およびメモリ128を有する。音声強調装置12は、制御部127の制御の下で各処理を実行し、各処理で得られたデータはメモリ128に格納され、必要に応じて読み出されて他の処理に使用される。 As illustrated in FIG. 4, the speech enhancement device 12 of the present embodiment includes a model storage unit 120, an input unit 121, a frequency domain conversion unit 122, a mask estimation unit 123, a mask application unit 124, a time domain conversion unit 125, and an output. It has a unit 126, a control unit 127, and a memory 128. The speech enhancement device 12 executes each process under the control of the control unit 127, and the data obtained in each process is stored in the memory 128, read out as needed, and used for other processes.
 <学習処理>
 本実施形態の学習処理を例示する。
 まず、目的音声信号s(j)および当該目的音声信号s(j)に対応する観測信号x(j)からなる学習データが用意される。ただし、j=1,…,Mであり、Mは1以上の整数である。目的音声信号s(1),…,s(M)は記憶部111に格納され、観測信号x(1),…,x(M)は記憶部112に格納される。この前提のもとで以下のステップ1,2,3の処理が実行される。
<Learning process>
The learning process of this embodiment is illustrated.
First, learning data is prepared consisting of the observed signal x (j) corresponding to the target speech signal s (j) and the target speech signal s (j). However, j = 1, ..., M, and M is an integer of 1 or more. The target audio signals s (1) , ..., S (M) are stored in the storage unit 111, and the observation signals x (1) , ..., X (M) are stored in the storage unit 112. Under this premise, the following steps 1, 2 and 3 are executed.
 ≪ステップ1:モデルMθの事前学習≫
 ステップ1ではモデルMθの事前学習が行われる。以下、図2を用いてステップ1の処理を説明する。
 まず、制御部119がモデルMθのパラメータθを初期値に設定する(ステップS119aa)。
≪Step 1: Pre-learning of model M θ≫
In step 1, the model M θ is pre-learned. Hereinafter, the process of step 1 will be described with reference to FIG.
First, the control unit 119 sets the parameter theta model M theta to the initial value (step S119aa).
 次に、制御部119がi=1に初期化する(ステップS119ab)。 Next, the control unit 119 initializes to i = 1 (step S119ab).
 次に、モデル適用部114aは記憶部112から観測信号x(i)を抽出し、観測信号x(i)にモデルMθを適用してマスクM(x(i);θ)を得て出力する(ステップS114a)。 Next, the model application unit 114a extracts the observation signal x (i) from the storage unit 112, applies the model M θ to the observation signal x (i) , obtains the mask M (x (i) ; θ), and outputs the mask M (x (i); θ). (Step S114a).
 マスクM(x(i);θ)は勾配計算部116aに入力される。勾配計算部116aは記憶部111から観測信号x(i)を抽出し、勾配∂φSDR M(i)を得て出力する。ただし、以下の式(1a)(2a)を満たし、
 y(i)=Q+(M(x(i);θ)◎Q(x(i)))    (1a)
 LSDR M(i) = -(clipβ[SDR(s(i),y(i))]+clipβ[SDR(n(i),m(i))])/2    (2a)
 m(i)=x(i)-y(i)である(ステップS116a)。
The mask M (x (i) ; θ) is input to the gradient calculation unit 116a. Gradient calculation unit 116a extracts the observed signal x (i) from the storage unit 111, and outputs the resulting gradient ∂ φ L SDR M (i) . However, the following equations (1a) and (2a) are satisfied,
y (i) = Q + (M (x (i) ; θ) ◎ Q (x (i) )) (1a)
L SDR M (i) =-(clip β [SDR (s (i) , y (i) )] + clip β [SDR (n (i) , m (i) )]) / 2 (2a)
m (i) = x (i) −y (i) (step S116a).
 制御部119はi=Nであるか否かを判定する。ただし、NはM以下の正整数であり、例えばN=5である(ステップS119b)。ここでi=Nでなければ、i+1を新たなiとしてステップS114aに戻る。 The control unit 119 determines whether or not i = N. However, N is a positive integer less than or equal to M, and for example, N = 5 (step S119b). If i = N here, the process returns to step S114a with i + 1 as a new i.
 一方、i=Nであれば、勾配∂φSDR M(1),…,∂φSDR M(N)が勾配計算部116bに入力される。勾配計算部116bは勾配∂φSDR M(1),…,∂φSDR M(N)を用いて勾配∂φSDR を得て出力する。ただし、勾配∂φSDR は∂φSDR M(1),…,∂φSDR M(N)の関数値であり、例えば∂φSDR =(∂φSDR M(1)+…+∂φSDR M(N))/Nである(ステップS116b)。 On the other hand, if i = N, the gradient ∂ φ L SDR M (1) , ..., ∂ φ L SDR M (N) is input into the gradient calculation unit 116 b. The gradient calculation unit 116b gradient ∂ φ L SDR M (1) , ..., and outputs the resulting gradient φ L SDR M using ∂ φ L SDR M (N) . However, the gradient φ L SDR M is ∂ φ L SDR M (1) , ..., a function value of ∂ φ L SDR M (N) , for example ∂ φ L SDR M = (∂ φ L SDR M (1) + ... it is a + ∂ φ L SDR M (N )) / N ( step S116b).
 勾配∂φSDR はパラメータ更新部117aに入力される。パラメータ更新部117aは、勾配∂φSDR を用いた勾配法によってパラメータθを更新して出力する。すなわち、LSDR =(LSDR M(1)+…+LSDR M(N))/Nを最小化するようにパラメータθを更新して出力する(ステップS117a)。 The gradient ∂ φ L SDR M is input to the parameter update unit 117a. Parameter updating unit 117a, and outputs the updated parameter θ by the gradient method using a gradient ∂ φ L SDR M. That is, the parameter θ is updated and output so as to minimize L SDR M = ( LSDR M (1) + ... + L SDR M (N)) / N (step S117a).
 制御部119は、収束条件を満たした否かを判定する。収束条件としては、ステップS114a,S116a,S119b,S116b,S117aの処理を一定回数繰り返したことや、θやLSDR の変化量が所定値以下であること、などを例示できる。収束条件を満たしていないのであればステップS119abに戻る。収束条件を満たしたのであればパラメータ更新部117aはパラメータθを出力し、ステップ1の処理を終了する(ステップS119c)。 The control unit 119 determines whether or not the convergence condition is satisfied. The convergence condition, step S114a, S116a, S119b, S116b, the processing of S117a is repeated a certain number of times and, the change amount of θ and L SDR M is equal to or less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119ab. If the convergence condition is satisfied, the parameter update unit 117a outputs the parameter θ and ends the process of step 1 (step S119c).
 ≪ステップ2:近似関数Dφ(s,・)の事前学習≫
 ステップ2では近似関数Dφ(s,・)の事前学習が行われる。以下、図2を用いてステップ2の処理を説明する。
 まず、制御部119が近似関数Dφ(s,・)のパラメータφを初期値に設定する(ステップS119da)。
<< Step 2: Pre-learning of the approximate function Dφ (s, ・) >>
In step 2, the approximation function D φ (s, ·) is pre-learned. Hereinafter, the process of step 2 will be described with reference to FIG.
First, the control unit 119 sets the parameter φ of the approximation function D φ (s, ·) to the initial value (step S119da).
 次に制御部119がj=1に初期化する(ステップS119db)。 Next, the control unit 119 initializes to j = 1 (step S119db).
 ステップ1で得られたパラメータθはマスク推定適用部113aに入力される。マスク推定適用部113aは、記憶部112から観測信号x(j)を抽出し、観測信号x(j)にモデルMθを適用してマスクM(x(j);θ)を得る。さらに、マスク推定適用部113aは、観測信号x(j)にマスクM(x(j);θ)を適用して、式(1b)のように強調音声信号y(j)を得て出力する(ステップS113a)。
 y(j)=Q+(M(x(j);θ)◎Q(x(j)))    (1b)
The parameter θ obtained in step 1 is input to the mask estimation application unit 113a. The mask estimation application unit 113a extracts the observation signal x (j) from the storage unit 112 and applies the model M θ to the observation signal x (j) to obtain the mask M (x (j) ; θ). Further, the mask estimator application unit 113a, the mask M to the observed signal x (j) (x (j ); θ) by applying, to emphasize the voice signal y (j) the obtained output as in equation (1b) (Step S113a).
y (j) = Q + (M (x (j) ; θ) ◎ Q (x (j) )) (1b)
 観測信号x(j)および強調音声信号y(j)は近似関数適用部115aに入力される。近似関数適用部115aは、さらに記憶部111から目的音声信号s(j)を抽出する。近似関数適用部115aは、s(j),x(j),y(j)を近似関数Dφ(s,・)に入力してDφ(s(j),s(j)),Dφ(s(j),x(j)),Dφ(s(j),y(j))を得て出力する(ステップS115a)。 The observed signal x (j) and the emphasized audio signal y (j) are input to the approximation function application unit 115a. The approximation function application unit 115a further extracts the target audio signal s (j) from the storage unit 111. The approximation function application unit 115a inputs s (j) , x (j) , y (j) into the approximation function D φ (s, ·) and D φ (s (j) , s (j) ), D. φ (s (j) , x (j) ), D φ (s (j) , y (j) ) are obtained and output (step S115a).
 Dφ(s(j),s(j)),Dφ(s(j),x(j)),Dφ(s(j),y(j))は勾配計算部116cに入力される。勾配計算部116cには、さらに計算されたP(s(j),s(j)),P(s(j),x(j)),P(s(j),y(j))が入力される。次に勾配計算部116cは、P(s(j),s(j))とDφ(s(j),s(j))の誤差εφ(s(j))、P(s(j),x(j))とDφ(s(j),x(j))の誤差εφ(x(j))、およびP(s(j),y(j))とDφ(s(j),y(j))の誤差εφ(y(j))を得、式(8a)に従って勾配∂φD(j)を得て出力する(ステップS116c)。
 LD(j)φ(s(j))+εφ(x(j))+εφ(y(j))    (8a)
D φ (s (j) , s (j) ), D φ (s (j) , x (j) ), D φ (s (j) , y (j) ) are input to the gradient calculation unit 116c. .. Further calculated P (s (j) , s (j) ), P (s (j) , x (j) ), P (s (j) , y (j) ) are stored in the gradient calculation unit 116c. Entered. Next, the gradient calculation unit 116c has an error between P (s (j) , s (j) ) and D φ (s (j) , s (j) ) εφ (s (j) ), P (s (j)). ) , X (j) ) and D φ (s (j) , x (j) ) error ε φ (x (j) ), and P (s (j) , y (j) ) and D φ (s) (j), to give the y error epsilon phi of (j)) (y (j )), to obtain a gradient ∂ φ L D (j) is output according to equation (8a) (step S116c).
L D (j) = ε φ (s (j) ) + ε φ (x (j) ) + ε φ (y (j) ) (8a)
 制御部119はj=Mであるか否かを判定する。例えばM=5である(ステップS119e)。ここでj=Mでなければj+1を新たなjとしてステップS113aに戻る。 The control unit 119 determines whether or not j = M. For example, M = 5 (step S119e). Here, if j = M, j + 1 is set as a new j and the process returns to step S113a.
 一方、j=Mであれば、勾配∂φD(1),…,∂φD(M)が勾配計算部116dに入力される。勾配計算部116dは勾配∂φD(1),…,∂φD(M)を用いて勾配∂φを得て出力する。ただし、勾配∂φは∂φD(1),…,∂φD(M)の関数値であり、例えば∂φ=(∂φD(1)+…+∂φD(M))/Mである(ステップS116d)。 On the other hand, if j = M, the gradient ∂ φ L D (1), ..., ∂ φ L D (M) is input into the gradient calculation unit 116d. Gradient calculation unit 116d gradient ∂ φ L D (1), ..., and outputs the resulting gradient ∂ phi L D using ∂ φ L D (M). However, the gradient ∂ φ L D is a function value of ∂ φ L D (1) , ..., ∂ φ L D (M) , for example, ∂ φ L D = (∂ φ L D (1) +… + ∂ φ LD (M) ) / M (step S116d).
 勾配∂φはパラメータ更新部117bに入力される。パラメータ更新部117bは、勾配∂φを用いた勾配法によってパラメータφを更新する。すなわちパラメータ更新部117bは、式(8)のLを最小化するようにパラメータφを更新して出力する(ステップS117b)。 The gradient ∂ φ L D is input to the parameter update unit 117b. Parameter updating unit 117b updates the parameter phi by the gradient method using a gradient ∂ φ L D. That parameter updating unit 117b outputs the L D of formula (8) by updating the parameters φ to minimize (step S117b).
 制御部119は、収束条件を満たした否かを判定する。収束条件としては、ステップS115a,S116c,S119e,S116d,S117bの処理を一定回数繰り返したことや、φやLの変化量が所定値以下であること、などを例示できる。収束条件を満たしていないのであればステップS119dbに戻る。収束条件を満たしたのであればパラメータ更新部117bはパラメータφを出力し、ステップ2の処理を終了する(ステップS119f)。 The control unit 119 determines whether or not the convergence condition is satisfied. The convergence condition, step S115a, S116c, S119e, S116d, the processing of S117b is repeated a certain number of times and, the change amount of the φ and L D is less than a predetermined value, and the like can be exemplified. If the convergence condition is not satisfied, the process returns to step S119db. If the convergence condition is satisfied, the parameter update unit 117b outputs the parameter φ and ends the process of step 2 (step S119f).
 なおステップ2は、目的音の主観音質評価P(s(j),s(j))と、入力の主観音質評価P(s(j),・)を模した客観指標を近似する第1近似関数Dφ(s(j),・)に目的音を表す目的音声信号s(j)を入力して得られる、目的音の主観音質評価を模した客観指標Dφ(s(j),s(j))と、の誤差εφ(s(j))と、目的音に基づく観測音の主観音質評価P(s(j),x(j))と、第1近似関数Dφ(s(j),・)に観測音を表す観測信号x(j)を入力して得られる、観測音の主観音質評価を模した客観指標Dφ(s(j),x(j))と、の誤差εφ(x(j))と、観測信号x(j)に第1マスクM(x(j);θ)を適用して得られるマスク後音信号に対応する強調音の主観音質評価P(s(j),y(j))と、第1近似関数Dφ(s(j),・)に強調音を表す強調音声信号y(j)を入力して得られる、強調音の主観音質評価を模した客観指標Dφ(s(j),y(j))と、の誤差εφ(y(j))と、の和に基づく第1コスト関数Lを最小化するように第1近似関数Dφ(s(j),・)を更新して第2近似関数Dφ(s(j),・)を得る近似関数学習ステップに相当する。この例では、第1マスクM(x(j);θ)は、観測信号x(j)に第1モデルMθを適用して得られるものであり、近似関数学習ステップは、第1モデルMθを更新することなく、第1コスト関数Lを最小化するように第1近似関数Dφ(s(j),・)を更新して第2近似関数Dφ(s(j),・)を得るステップである。 In step 2, the first approximation that approximates the subjective sound quality evaluation P (s (j) , s (j) ) of the target sound and the objective index that imitates the subjective sound quality evaluation P (s (j), ·) of the input. An objective index D φ (s (j) , s) that imitates the subjective sound quality evaluation of the target sound, which is obtained by inputting the target sound signal s (j) representing the target sound to the function D φ (s (j), ·). (J) ), the error ε φ (s (j) ), the subjective sound quality evaluation P (s (j) , x (j) ) of the observed sound based on the target sound, and the first approximation function D φ (s). An objective index Dφ (s (j) , x (j) ) that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the observation signal x (j) representing the observed sound into (j), ·), Subjective sound quality evaluation of the emphasized sound corresponding to the error ε φ (x (j) ) and the post-masked sound signal obtained by applying the first mask M (x (j) ; θ) to the observed signal x (j). The emphasized sound obtained by inputting P (s (j) , y (j) ) and the emphasized sound signal y (j) representing the emphasized sound to the first approximation function (s (j), ·). objective indices D of the subjective quality evaluation imitating φ (s (j), y (j)) and, the error ε φ (y (j)) and, to minimize the first cost function L D based on the sum of Corresponds to the approximation function learning step in which the first approximation function D φ (s (j) , ·) is updated to obtain the second approximation function D φ (s (j), ·). In this example, the first mask M (x (j) ; θ) is obtained by applying the first model M θ to the observed signal x (j) , and the approximation function learning step is the first model M. The first approximation function D φ (s (j) , ·) is updated to minimize the first cost function L D without updating θ , and the second approximation function D φ (s (j) , · ·). ) Is the step to obtain.
 ≪ステップ3:モデルMθと近似関数Dφ(s,・)の学習処理≫
 最後のステップ3では、モデルMθと近似関数Dφ(s,・)を交互に更新する学習を行う。ステップ3では、例えばN=5,M=10とする。以下、図3を用いてステップ3の処理を説明する。
<< Step 3: Learning process of model M θ and approximate function D φ (s, ・) >>
In the final step 3, learning is performed to alternately update the model M θ and the approximate function D φ (s, ·). In step 3, for example, N = 5 and M = 10. Hereinafter, the process of step 3 will be described with reference to FIG.
 まず、制御部119がステップ1,2で得られたパラメータθ,φを初期値に設定する(ステップS119ga)。 First, the control unit 119 sets the parameters θ and φ obtained in steps 1 and 2 to the initial values (step S119ga).
 次に、制御部119がi=1,j=1に初期化する(ステップS119gb)。 Next, the control unit 119 initializes to i = 1, j = 1 (step S119gb).
 次に、モデル適用部114bは記憶部112から観測信号x(i)を抽出し、観測信号x(i)にモデルMθを適用してマスクM(x(i);θ)を得て出力する(ステップS114b)。 Next, the model application unit 114b extracts the observation signal x (i) from the storage unit 112, applies the model M θ to the observation signal x (i) , obtains the mask M (x (i) ; θ), and outputs the mask M (x (i); θ). (Step S114b).
 マスクM(x(i);θ)はマスク適用部113bに入力される。マスク適用部113bにはさらにQ(x(i))が入力され、マスク適用部113bは、式(1a)に従って強調音声信号y(i)を得て出力する(ステップS113b)。 The mask M (x (i) ; θ) is input to the mask application unit 113b. Q (x (i) ) is further input to the mask application unit 113b, and the mask application unit 113b obtains and outputs the emphasized audio signal y (i) according to the equation (1a) (step S113b).
 強調音声信号y(i)は近似関数適用部115bに入力される。近似関数適用部115bは、さらに記憶部111から目的音声信号s(i)を抽出する。近似関数適用部115aは、s(i),y(i)を近似関数Dφ(s,・)に入力してDφ(s(i),y(i))を得て出力する(ステップS115b)。 The emphasized audio signal y (i) is input to the approximation function application unit 115b. The approximation function application unit 115b further extracts the target audio signal s (i) from the storage unit 111. The approximate function application unit 115a inputs s (i) and y (i) into the approximate function D φ (s, ·) to obtain D φ (s (i) , y (i) ) and outputs it (step). S115b).
 Dφ(s(i),y(i))は勾配計算部116eに入力される。勾配計算部116eは勾配∂φM(i)を得て出力する。ただし、LM(i)は以下の式(4b)を満たす(ステップS116e)。
LM(i) = -E[Dφ(s(i),y(i))]x,y   (4b)
D φ (s (i) , y (i) ) is input to the gradient calculation unit 116e. Gradient calculation unit 116e and outputs the resulting gradient ∂ φ L M (i). However, LM (i) satisfies the following equation (4b) (step S116e).
L M (i) = -E [D φ (s (i) , y (i) )] x, y (4b)
 制御部119はi=Nであるか否かを判定する(ステップS119h)。ここでi=Nでなければ、i+1を新たなiとしてステップS114bに戻る。 The control unit 119 determines whether or not i = N (step S119h). If i = N here, the process returns to step S114b with i + 1 as a new i.
 一方、i=Nであれば、勾配∂φM(1),…,∂φM(N)が勾配計算部116fに入力される。勾配計算部116fは勾配∂φM(1),…,∂φM(N)を用いて勾配∂φを得て出力する。ただし、∂φは∂φM(1),…,∂φM(N)の関数値であり、例えば∂φ=(∂φM(1)+…+∂φM(N))/Nである(ステップS116f)。 On the other hand, if i = N, the gradient ∂ φ L M (1), ..., ∂ φ L M (N) is input into the gradient calculation unit 116f. The gradient calculation unit 116f gradient ∂ φ L M (1), ..., and outputs the resulting gradient ∂ phi L M using ∂ φ L M (N). However, ∂ φ L M is ∂ φ L M (1), ..., a function value of φ L M (N), for example ∂ φ L M = (∂ φ L M (1) + ... + ∂ φ L M (N) ) / N (step S116f).
 勾配∂φはパラメータ更新部117cに入力される。パラメータ更新部117cは、勾配∂φを用いた勾配法によってパラメータθを更新して出力する。すなわち、L=(LM(1)+…+LM(N))/Nを最小化するようにパラメータθを更新して出力する。更新されたパラメータθはモデル適用部114b,114cに入力される(ステップS117c)。 Gradient ∂ phi L M is input to the parameter update unit 117c. Parameter updating unit 117c, and outputs the updated parameter θ by the gradient method using a gradient ∂ φ L M. That, L M = (L M ( 1) + ... + L M (N)) / N and updates and outputs the parameter θ to minimize. The updated parameter θ is input to the model application units 114b and 114c (step S117c).
 次に、モデル適用部114cは記憶部112から観測信号x(j)を抽出し、観測信号x(j)にモデルMθを適用してマスクM(x(j);θ)を得て出力する(ステップS114c)。 Next, the model application unit 114c extracts the observation signal x (j) from the storage unit 112, applies the model M θ to the observation signal x (j) , obtains the mask M (x (j) ; θ), and outputs the mask M (x (j); θ). (Step S114c).
 マスクM(x(j);θ)はマスク適用部113cに入力される。マスク適用部113cにはさらにQ(x(j))が入力され、マスク適用部113cは、以下の式(1a’)に従って強調音声信号y(j)を得て出力する(ステップS113b)。
 y(j)=Q+(M(x(j);θ)◎Q(x(j)))    (1a')
The mask M (x (j) ; θ) is input to the mask application unit 113c. Q (x (j) ) is further input to the mask application unit 113c, and the mask application unit 113c obtains and outputs the emphasized audio signal y (j) according to the following equation (1a') (step S113b).
y (j) = Q + (M (x (j) ; θ) ◎ Q (x (j) )) (1a')
 強調音声信号y(j)は近似関数適用部115cに入力される。さらに近似関数適用部115cは、記憶部111から目的音声信号s(j)を抽出し、記憶部112から観測信号x(j)を抽出する。近似関数適用部115cは、s(j),x(j),y(j)を近似関数Dφ(s,・)に入力してDφ(s(j),s(j)),Dφ(s(j),x(j)),Dφ(s(j),y(j))を得て出力する(ステップS115c)。 The emphasized audio signal y (j) is input to the approximation function application unit 115c. Further, the approximation function application unit 115c extracts the target audio signal s (j) from the storage unit 111, and extracts the observation signal x (j) from the storage unit 112. The approximation function application unit 115c inputs s (j) , x (j) , y (j) into the approximation function D φ (s, ·) and D φ (s (j) , s (j) ), D. φ (s (j) , x (j) ), D φ (s (j) , y (j) ) are obtained and output (step S115c).
 Dφ(s(j),s(j)),Dφ(s(j),x(j)),Dφ(s(j),y(j))は勾配計算部116gに入力される。勾配計算部116gには、さらに計算されたP(s(j),s(j)),P(s(j),x(j)),P(s(j),y(j))が入力される。次に勾配計算部116gは、P(s(j),s(j))とDφ(s(j),s(j))の誤差εφ(s(j))、P(s(j),x(j))とDφ(s(j),x(j))の誤差εφ(x(j))、およびP(s(j),y(j))とDφ(s(j),y(j))の誤差εφ(y(j))を得、勾配∂φD(j)を得て出力する(式(8a))(ステップS116g)。 D φ (s (j) , s (j) ), D φ (s (j) , x (j) ), D φ (s (j) , y (j) ) are input to the gradient calculation unit 116g. .. Further calculated P (s (j) , s (j) ), P (s (j) , x (j) ), P (s (j) , y (j) ) are stored in the gradient calculation unit 116g. Entered. Next, the gradient calculation unit 116g has an error between P (s (j) , s (j) ) and D φ (s (j) , s (j) ) εφ (s (j) ), P (s (j)). ) , X (j) ) and D φ (s (j) , x (j) ) error ε φ (x (j) ), and P (s (j) , y (j) ) and D φ (s) (j), y obtained error epsilon phi a (y (j)) of the (j)), (outputs to give j) (equation (8a) gradient ∂ φ L D) (step S116g).
 制御部119はj=Mであるか否かを判定する(ステップS119i)。ここでj=Mでなければj+1を新たなjとしてステップS114cに戻る。 The control unit 119 determines whether or not j = M (step S119i). Here, if j = M, j + 1 is set as a new j and the process returns to step S114c.
 一方、j=Mであれば、勾配∂φD(1),…,∂φD(M)が勾配計算部116hに入力される。勾配計算部116hは勾配∂φD(1),…,∂φD(M)を用いて勾配∂φを得て出力する(ステップS116h)。 On the other hand, if j = M, the gradient ∂ φ L D (1), ..., ∂ φ L D (M) is input into the gradient calculation unit 116h. The gradient calculation unit 116h gradient ∂ φ L D (1), ..., obtained by outputting the gradient ∂ phi L D using φ L D (M) (step S116h).
 勾配∂φはパラメータ更新部117fに入力される。パラメータ更新部117fは、勾配∂φを用いた勾配法によってパラメータφを更新する。すなわちパラメータ更新部117bは、式(8)のLを最小化するようにパラメータφを更新して出力する。パラメータφは近似関数適用部115b,115cに入力される(ステップS117d)。 The gradient ∂ φ L D is input to the parameter update unit 117f. Parameter updating unit 117f updates the parameter phi by the gradient method using a gradient ∂ φ L D. That parameter updating unit 117b, and outputs the updated parameters φ to minimize L D of formula (8). The parameter φ is input to the approximation function application units 115b and 115c (step S117d).
 制御部119は、収束条件を満たした否かを判定する。収束条件としては、ステップS114b,S113b,S115b,S116e,S119h,S116f,S117c,S114c,S113c,S115c,S116g,S119e,S116h,S117dの処理を一定回数繰り返したことや、θ,φやL,Lの変化量が所定値以下であること、などを例示できる。収束条件を満たしたのであればパラメータ更新部117cはパラメータθを出力し、パラメータ更新部117dはパラメータφを出力し、ステップ3の処理を終了する(ステップS119j)。 The control unit 119 determines whether or not the convergence condition is satisfied. The convergence condition, step S114b, S113b, S115b, S116e, S119h, S116f, S117c, S114c, S113c, S115c, S116g, S119e, S116h, and the treatment was repeated a certain number of S117d, θ, φ and L M, the change amount of the L D is less than a predetermined value, and which may or may not be. If the convergence condition is satisfied, the parameter update unit 117c outputs the parameter θ, the parameter update unit 117d outputs the parameter φ, and the process of step 3 ends (step S119j).
 なお、S114c,S113c,S115c,S116g,S119e,S116h,S117dの処理は、目的音の主観音質評価P(s(j),s(j))と、入力の主観音質評価P(s(j),・)を模した客観指標を近似する第1近似関数Dφ(s(j),・)に目的音を表す目的音声信号s(j)を入力して得られる、目的音の主観音質評価を模した客観指標Dφ(s(j),s(j))と、の誤差εφ(s(j))と、目的音に基づく観測音の主観音質評価P(s(j),x(j))と、第1近似関数Dφ(s(j),・)に観測音を表す観測信号x(j)を入力して得られる、観測音の主観音質評価を模した客観指標Dφ(s(j),x(j))と、の誤差εφ(x(j))と、観測信号x(j)に第1マスクM(x(j);θ)を適用して得られるマスク後音信号に対応する強調音の主観音質評価P(s(j),y(j))と、第1近似関数Dφ(s(j),・)に強調音を表す強調音声信号y(j)を入力して得られる、強調音の主観音質評価を模した客観指標Dφ(s(j),y(j))と、の誤差εφ(y(j))と、の和に基づく第1コスト関数Lを最小化するように第1近似関数Dφ(s(j),・)を更新して第2近似関数Dφ(s(j),・)を得る近似関数学習ステップに相当する。この例では、第1マスクM(x(j);θ)は、観測信号x(j)に第1モデルMθを適用して得られるものであり、近似関数学習ステップは、第1モデルMθを更新することなく、第1コスト関数Lを最小化するように第1近似関数Dφ(s(j),・)を更新して第2近似関数Dφ(s(j),・)を得るステップである。また、ステップS114b,S113b,S115b,S116e,S119h,S116f,S117cの処理は、第2近似関数Dφ(s(i),・)に強調音を表す強調音声信号y(i)を入力して得られる、強調音の主観音質評価を模した第2客観指標Dφ(s(i),y(i))、に関する期待値に基づく第2コスト関数LM(i)を最小化するように第1モデルMθを更新して第2モデルMθを得るものである。 In addition, S114c, S113c, S115c, S116g , S119e, S116h, processing of S117d is, subjective sound quality evaluation P of the target sound (s (j), s ( j)) and the input of subjective sound quality evaluation P (s (j) , ・) Subjective sound quality evaluation of the target sound obtained by inputting the target sound signal s (j) representing the target sound into the first approximation function D φ (s (j) , ・) that approximates the objective index imitating). Objective index D φ (s (j) , s (j) ) that imitates the error ε φ (s (j) ) and subjective sound quality evaluation P (s (j) , x) of the observed sound based on the target sound. (J) ) and the objective index D that imitates the subjective sound quality evaluation of the observed sound, which is obtained by inputting the observation signal x (j) representing the observed sound into the first approximate function (s (j), ·). Obtained by applying the first mask M (x (j) ; θ) to the error ε φ (x (j) ) between φ (s (j) , x (j) ) and the observed signal x (j). emphasized sound of subjective sound quality P corresponding to the mask following tone signal which is an (s (j), y ( j)), enhanced speech signal representing the emphasized sound to the first approximate function D φ (s (j), ·) obtained by inputting the y (j), and the objective index D φ the subjective sound quality evaluation imitating emphasis sound and the (s (j), y ( j)), the error ε φ (y (j)) , of approximation to obtain a first approximation function D phi to minimize the first cost function L D based on the sum (s (j), ·) second approximation to update the function D φ (s (j), ·) Corresponds to the function learning step. In this example, the first mask M (x (j) ; θ) is obtained by applying the first model M θ to the observed signal x (j) , and the approximation function learning step is the first model M. The first approximation function D φ (s (j) , ·) is updated to minimize the first cost function L D without updating θ , and the second approximation function D φ (s (j) , · ·). ) Is the step to obtain. Also, step S114b, S113b, S115b, S116e, S119h, S116f, processing S117c, the second approximate function D φ (s (i), ·) to enter the enhanced speech signal y (i) representing the emphasized sound Minimize the second cost function LM (i) based on the expected value of the obtained second objective index Dφ (s (i) , y (i) ) that mimics the subjective sound quality evaluation of the emphasized sound. The first model M θ is updated to obtain the second model M θ .
 <音声強調処理>
 上述のように学習されたモデルMθおよび近似関数Dφ(s,・)を特定する情報は、音声強調装置12(図4)のモデル記憶部120に格納される。例えば、ステップS119jで出力されたパラメータθ,φがモデル記憶部120に格納される。この前提の下、以下のような音声強調処理が実行される。
<Speech enhancement processing>
The information for identifying the model M θ and the approximate function D φ (s, ·) learned as described above is stored in the model storage unit 120 of the speech enhancement device 12 (FIG. 4). For example, the parameters θ and φ output in step S119j are stored in the model storage unit 120. Under this premise, the following speech enhancement processing is executed.
 図5に例示するように、音声強調装置12(図4)の入力部121には、時間領域の時系列音響信号である観測信号xが入力される(ステップS121)。 As illustrated in FIG. 5, an observation signal x, which is a time-series acoustic signal in the time domain, is input to the input unit 121 of the speech enhancement device 12 (FIG. 4) (step S121).
 観測信号xは周波数領域変換部122に入力される。周波数領域変換部122は、短時間フーリエ変換などの周波数領域変換処理Qによって、観測信号xを時間周波数領域表現した観測信号X=Q(x)を得て出力する(ステップS122)。 The observation signal x is input to the frequency domain conversion unit 122. The frequency domain conversion unit 122 obtains and outputs an observation signal X = Q (x) expressing the observation signal x in the time frequency domain by a frequency domain conversion process Q such as a short-time Fourier transform (step S122).
 観測信号xはマスク推定部123に入力される。マスク推定部123は、観測信号xにモデルMθを適用してT-FマスクM(x;θ)を推定して出力する(ステップS123)。 The observation signal x is input to the mask estimation unit 123. The mask estimation unit 123 applies the model M θ to the observation signal x to estimate and output the TF mask M (x; θ) (step S123).
 観測信号XおよびT-FマスクM(x;θ)はマスク適用部124に入力される。マスク適用部124は、時間周波数領域で観測信号XにT-FマスクM(x;θ)を適用し(乗算し)、マスク後音声信号M(x;θ)◎Xを得て出力する(ステップS124)。 The observation signal X and the TF mask M (x; θ) are input to the mask application unit 124. The mask application unit 124 applies (multiplies) the TF mask M (x; θ) to the observation signal X in the time frequency domain, obtains and outputs the masked audio signal M (x; θ) ◎ X ( Step S124).
 マスク後音声信号M(x;θ)◎Xは、時間領域変換部125に入力される。時間領域変換部125は、マスク後音声信号M(x;θ)◎Xに逆STFTなどの時間領域変換処理Q+を適用し、時間領域の強調音声yを得て出力する(式(1))(ステップS126)。 After masking, the audio signal M (x; θ) ⊚ X is input to the time domain conversion unit 125. The time domain conversion unit 125 applies a time domain conversion process Q + such as an inverse FTFT to the masked voice signal M (x; θ) ◎ X to obtain and output the time domain emphasized voice y (Equation (1)). ) (Step S126).
 [有効性検証]
 本発明の有効性を検証するために、音声強調の公開データセット(非特許文献4)を用いて実験を行った。図7に、ステップ1,2で事前学習済みのモデルMθおよび近似関数Dφ(s,・)に対し、異なる乱数シードを利用して、ステップ3のモデルMθと近似関数Dφ(s,・)の学習処理を行った結果を示す。ここでは、OSQASにPESQを利用した。図7からもわかる通り、学習が進むにつれ、安定してPESQが向上していることがわかる。学習終了後のPESQ値を比較した結果、非特許文献3の従来技術ではPESQが2.86であったのに対し、本実施形態の方式では2.93であった。このことから、本方式はOSQASを利用したDNN音声強調の学習に有効と言える。
[Validity verification]
In order to verify the effectiveness of the present invention, an experiment was conducted using a public data set of speech enhancement (Non-Patent Document 4). 7, with respect to pre-trained model M theta and approximation function D φ (s, ·) in step 1 and 2, by using a different random number seed, approximate model M theta Step 3 function D phi (s The result of the learning process of, ・) is shown. Here, PESQ was used for OSQAS. As can be seen from FIG. 7, it can be seen that the PESQ is steadily improving as the learning progresses. As a result of comparing the PESQ values after the end of learning, the PESQ was 2.86 in the prior art of Non-Patent Document 3, whereas it was 2.93 in the method of the present embodiment. From this, it can be said that this method is effective for learning DNN speech enhancement using OSQAS.
 [ハードウェア構成]
 各実施形態における学習装置11および音声強調装置12は、例えば、CPU(central processing unit)等のプロセッサ(ハードウェア・プロセッサ)やRAM(random-access memory)・ROM(read-only memory)等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される装置である。このコンピュータは1個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めROM等に記録されていてもよい。また、CPUのようにプログラムが読み込まれることで機能構成を実現する電子回路(circuitry)ではなく、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、1個の装置を構成する電子回路が複数のCPUを含んでいてもよい。
[Hardware configuration]
The learning device 11 and the voice enhancing device 12 in each embodiment are, for example, a processor (hardware processor) such as a CPU (central processing unit) or a memory such as a RAM (random-access memory) or a ROM (read-only memory). It is a device configured by executing a predetermined program by a general-purpose or dedicated computer equipped with the above. This computer may have one processor and memory, or may have a plurality of processors and memory. This program may be installed in a computer or may be recorded in a ROM or the like in advance. Further, a part or all of the processing units may be configured by using an electronic circuit that realizes a processing function independently, instead of an electronic circuit (circuitry) that realizes a function configuration by reading a program like a CPU. .. Further, the electronic circuit constituting one device may include a plurality of CPUs.
 図8は、各実施形態における学習装置11および音声強調装置12のハードウェア構成を例示したブロック図である。図8に例示するように、この例の習装置11および音声強調装置12は、CPU(Central Processing Unit)10a、出力部10b、出力部10c、RAM(Random Access Memory)10d、ROM(Read Only Memory)10e、補助記憶装置10f及びバス10gを有している。この例のCPU10aは、制御部10aa、演算部10ab及びレジスタ10acを有し、レジスタ10acに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、出力部10bは、データが出力される出力端子、ディスプレイ等である。また、出力部10cは、所定のプログラムを読み込んだCPU10aによって制御されるLANカード等である。また、RAM10dは、SRAM (Static Random Access Memory)、DRAM (Dynamic Random Access Memory)等であり、所定のプログラムが格納されるプログラム領域10da及び各種データが格納されるデータ領域10dbを有している。また、補助記憶装置10fは、例えば、ハードディスク、MO(Magneto-Optical disc)、半導体メモリ等であり、所定のプログラムが格納されるプログラム領域10fa及び各種データが格納されるデータ領域10fbを有している。また、バス10gは、CPU10a、出力部10b、出力部10c、RAM10d、ROM10e及び補助記憶装置10fを、情報のやり取りが可能なように接続する。CPU10aは、読み込まれたOS(Operating System)プログラムに従い、補助記憶装置10fのプログラム領域10faに格納されているプログラムをRAM10dのプログラム領域10daに書き込む。同様にCPU10aは、補助記憶装置10fのデータ領域10fbに格納されている各種データを、RAM10dのデータ領域10dbに書き込む。そして、このプログラムやデータが書き込まれたRAM10d上のアドレスがCPU10aのレジスタ10acに格納される。CPU10aの制御部10abは、レジスタ10acに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すRAM10d上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部10abに順次実行させ、その演算結果をレジスタ10acに格納していく。このような構成により、学習装置11および音声強調装置12の機能構成が実現される。 FIG. 8 is a block diagram illustrating the hardware configurations of the learning device 11 and the speech enhancement device 12 in each embodiment. As illustrated in FIG. 8, the learning device 11 and the voice enhancing device 12 of this example include a CPU (Central Processing Unit) 10a, an output unit 10b, an output unit 10c, a RAM (Random Access Memory) 10d, and a ROM (Read Only Memory). ) 10e, an auxiliary storage device 10f, and a bus 10g. The CPU 10a of this example has a control unit 10aa, a calculation unit 10ab, and a register 10ac, and executes various arithmetic processes according to various programs read into the register 10ac. Further, the output unit 10b is an output terminal, a display, or the like on which data is output. Further, the output unit 10c is a LAN card or the like controlled by the CPU 10a that has read a predetermined program. Further, the RAM 10d is a SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. Further, the auxiliary storage device 10f is, for example, a hard disk, MO (Magneto-Optical disc), a semiconductor memory, or the like, and has a program area 10fa for storing a predetermined program and a data area 10fb for storing various data. There is. Further, the bus 10g connects the CPU 10a, the output unit 10b, the output unit 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d in which this program or data is written is stored in the register 10ac of the CPU 10a. The control unit 10ab of the CPU 10a sequentially reads out these addresses stored in the register 10ac, reads a program or data from the area on the RAM 10d indicated by the read address, and causes the arithmetic unit 10ab to sequentially execute the operations indicated by the program. The calculation result is stored in the register 10ac. With such a configuration, the functional configuration of the learning device 11 and the speech enhancement device 12 is realized.
 上述のプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は非一時的な(non-transitory)記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 The above program can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, and the like.
 このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。上述のように、このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。 The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network. As described above, the computer that executes such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be executed sequentially. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
 各実施形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In each embodiment, the present device is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.
 [変形例]
 なお、本発明は上述の実施の形態に限定されるものではない。例えば、OSQASはPESQに限定されず、人間の主観的な音質評価を模した客観指標であればどのような値であってもよい。
[Modification example]
The present invention is not limited to the above-described embodiment. For example, OSQAS is not limited to PESQ, and may be any value as long as it is an objective index that imitates the subjective evaluation of human sound quality.
 ステップ3では、モデルMθの学習を最初に行っていたが、ステップ3で最初に近似関数Dφ(s,・)の学習が行われてもよい。上述の実施形態ではDNNが用いられたが確率モデルなどその他のモデルが用いられてもよい。 In step 3, the model M θ was trained first, but in step 3, the approximate function D φ (s, ·) may be trained first. Although DNN is used in the above-described embodiment, other models such as a probabilistic model may be used.
 また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Further, the various processes described above are not only executed in chronological order according to the description, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. In addition, it goes without saying that changes can be made as appropriate without departing from the spirit of the present invention.
11 学習装置
12 音声強調装置
11 Learning device 12 Speech enhancement device

Claims (8)

  1.  目的音の主観音質評価と、入力の主観音質評価を模した客観指標を近似する第1近似関数に前記目的音を表す目的音声信号を入力して得られる、前記目的音の主観音質評価を模した客観指標と、の誤差と、
    前記目的音に基づく観測音の主観音質評価と、前記第1近似関数に前記観測音を表す観測信号を入力して得られる、前記観測音の主観音質評価を模した客観指標と、の誤差と、
    前記観測信号に第1マスクを適用して得られるマスク後音信号に対応する強調音の主観音質評価と、前記第1近似関数に前記強調音を表す強調音声信号を入力して得られる、前記強調音の主観音質評価を模した客観指標と、の誤差と、
    の和に基づく第1コスト関数を最小化するように前記第1近似関数を更新して第2近似関数を得る近似関数学習ステップ
    を有する学習方法。
    It imitates the subjective sound quality evaluation of the target sound obtained by inputting the target sound signal representing the target sound into the first approximation function that approximates the subjective sound quality evaluation of the target sound and the objective index that imitates the subjective sound quality evaluation of the input. The error between the objective index and the sound quality
    The error between the subjective sound quality evaluation of the observed sound based on the target sound and the objective index imitating the subjective sound quality evaluation of the observed sound obtained by inputting the observation signal representing the observed sound into the first approximation function. ,
    The subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound are input to the first approximation function. The error between the objective index that imitates the subjective sound quality evaluation of the emphasized sound, and
    A learning method having an approximation function learning step of updating the first approximation function to obtain a second approximation function so as to minimize the first cost function based on the sum of.
  2.  請求項1の学習方法であって、
     前記第1マスクは、前記観測信号に第1モデルを適用して得られるものであり、
     前記近似関数学習ステップは、前記第1モデルを更新することなく、前記第1コスト関数を最小化するように前記第1近似関数を更新して前記第2近似関数を得るステップである、学習方法。
    The learning method of claim 1.
    The first mask is obtained by applying the first model to the observation signal.
    The approximate function learning step is a step of updating the first approximate function so as to minimize the first cost function and obtaining the second approximate function without updating the first model. ..
  3.  請求項2の学習方法であって、
     前記第2近似関数に前記強調音を表す強調音声信号を入力して得られる、前記強調音の主観音質評価を模した第2客観指標、に関する期待値に基づく第2コスト関数を最小化するように前記第1モデルを更新して第2モデルを得るマスク学習ステップを有する学習方法。
    The learning method of claim 2.
    Minimize the second cost function based on the expected value of the second objective index, which is obtained by inputting the emphasized audio signal representing the emphasized sound into the second approximate function and imitating the subjective sound quality evaluation of the emphasized sound. A learning method comprising a mask learning step of updating the first model to obtain a second model.
  4.  請求項3の第2モデルに観測音を適用してマスクを推定するマスク推定ステップと、
     前記観測音に前記マスクを適用し、マスク後音声信号を取得するマスク適用ステップと
    を有する音声強調方法。
    A mask estimation step of applying an observation sound to the second model of claim 3 to estimate a mask,
    A speech enhancement method including a mask application step of applying the mask to the observed sound and acquiring a voice signal after masking.
  5.  目的音の主観音質評価と、入力の主観音質評価を模した客観指標を近似する第1近似関数に前記目的音を表す目的音声信号を入力して得られる、前記目的音の主観音質評価を模した客観指標と、の誤差と、
    前記目的音に基づく観測音の主観音質評価と、前記第1近似関数に前記観測音を表す観測信号を入力して得られる、前記観測音の主観音質評価を模した客観指標と、の誤差と、
    前記観測信号に第1マスクを適用して得られるマスク後音信号に対応する強調音の主観音質評価と、前記第1近似関数に前記強調音を表す強調音声信号を入力して得られる、前記強調音の主観音質評価を模した客観指標と、の誤差と、
    の和に基づく第1コスト関数を最小化するように前記第1近似関数を更新して第2近似関数を得る近似関数学習部
    を有する学習装置。
    It imitates the subjective sound quality evaluation of the target sound obtained by inputting the target sound signal representing the target sound into the first approximation function that approximates the subjective sound quality evaluation of the target sound and the objective index that imitates the subjective sound quality evaluation of the input. The error between the objective index and the sound quality
    The error between the subjective sound quality evaluation of the observed sound based on the target sound and the objective index imitating the subjective sound quality evaluation of the observed sound obtained by inputting the observation signal representing the observed sound into the first approximation function. ,
    The subjective sound quality evaluation of the emphasized sound corresponding to the after-masked sound signal obtained by applying the first mask to the observed signal, and the emphasized sound signal representing the emphasized sound are input to the first approximation function. The error between the objective index that imitates the subjective sound quality evaluation of the emphasized sound, and
    A learning device having an approximation function learning unit that updates the first approximation function to obtain a second approximation function so as to minimize the first cost function based on the sum of.
  6.  請求項3の第2モデルに観測音を適用してマスクを推定するマスク推定部と、
     前記観測音に前記マスクを適用し、マスク後音声信号を取得するマスク適用部と
    を有する音声強調装置。
    A mask estimation unit that estimates a mask by applying an observation sound to the second model of claim 3,
    A speech enhancement device having a mask application unit that applies the mask to the observed sound and acquires a voice signal after masking.
  7.  請求項1から3の何れかの学習方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute any of the learning methods of claims 1 to 3.
  8.  請求項4の音声強調方法をコンピュータに実行させるためのプログラム。
     
    A program for causing a computer to execute the speech enhancement method of claim 4.
PCT/JP2020/002270 2020-01-23 2020-01-23 Learning device, speech emphasis device, methods therefor, and program WO2021149213A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/002270 WO2021149213A1 (en) 2020-01-23 2020-01-23 Learning device, speech emphasis device, methods therefor, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/002270 WO2021149213A1 (en) 2020-01-23 2020-01-23 Learning device, speech emphasis device, methods therefor, and program

Publications (1)

Publication Number Publication Date
WO2021149213A1 true WO2021149213A1 (en) 2021-07-29

Family

ID=76993302

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/002270 WO2021149213A1 (en) 2020-01-23 2020-01-23 Learning device, speech emphasis device, methods therefor, and program

Country Status (1)

Country Link
WO (1) WO2021149213A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0689095A (en) * 1992-09-08 1994-03-29 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal selector
JP2006313181A (en) * 2005-05-06 2006-11-16 Nissan Motor Co Ltd Voice input device and voice input method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0689095A (en) * 1992-09-08 1994-03-29 Nippon Telegr & Teleph Corp <Ntt> Acoustic signal selector
JP2006313181A (en) * 2005-05-06 2006-11-16 Nissan Motor Co Ltd Voice input device and voice input method

Similar Documents

Publication Publication Date Title
Magron et al. Model-based STFT phase recovery for audio source separation
JP6623376B2 (en) Sound source enhancement device, its method, and program
Pigoli et al. The statistical analysis of acoustic phonetic data: Exploring differences between spoken Romance languages
Hwang et al. LP-WaveNet: Linear prediction-based WaveNet speech synthesis
WO2020045313A1 (en) Mask estimation device, mask estimation method, and mask estimation program
Llombart et al. Progressive loss functions for speech enhancement with deep neural networks
JP6721165B2 (en) Input sound mask processing learning device, input data processing function learning device, input sound mask processing learning method, input data processing function learning method, program
Ackermann et al. Comparative evaluation of interpolation methods for the directivity of musical instruments
Ben Kheder et al. Robust speaker recognition using map estimation of additive noise in i-vectors space
WO2021149213A1 (en) Learning device, speech emphasis device, methods therefor, and program
Yang et al. Don’t separate, learn to remix: End-to-end neural remixing with joint optimization
JP4981579B2 (en) Error correction model learning method, apparatus, program, and recording medium recording the program
JP2007304445A (en) Repair-extraction method of frequency component, repair-extraction device of frequency component, repair-extraction program of frequency component, and recording medium which records repair-extraction program of frequecy component
GB2622654A (en) Patched multi-condition training for robust speech recognition
Gabrielli et al. A multi-stage algorithm for acoustic physical model parameters estimation
Lü et al. Feature compensation based on independent noise estimation for robust speech recognition
US11676619B2 (en) Noise spatial covariance matrix estimation apparatus, noise spatial covariance matrix estimation method, and program
US20200273480A1 (en) Sound source separating device, sound source separating method, and program
JP7264282B2 (en) Speech enhancement device, learning device, method thereof, and program
JP7156064B2 (en) Latent variable optimization device, filter coefficient optimization device, latent variable optimization method, filter coefficient optimization method, program
Moliner et al. Zero-shot blind audio bandwidth extension
Li et al. Robust Non‐negative matrix factorization with β‐divergence for speech separation
WO2021255925A1 (en) Target sound signal generation device, target sound signal generation method, and program
Sustek et al. Dealing with Unknowns in Continual Learning for End-to-end Automatic Speech Recognition.
Paniagua-Peñaranda et al. Assessing the robustness of recurrent neural networks to enhance the spectrum of reverberated speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 20915634

Country of ref document: EP

Kind code of ref document: A1