WO2017204226A1

WO2017204226A1 - System and method for target sound signal restoration

Info

Publication number: WO2017204226A1
Application number: PCT/JP2017/019259
Authority: WO
Inventors: 宜昭坂東; 和佳吉井; 克寿糸山; 博奥乃
Original assignee: 国立大学法人京都大学
Priority date: 2016-05-23
Filing date: 2017-05-23
Publication date: 2017-11-30
Also published as: JP6886720B2; JPWO2017204226A1

Abstract

Provided are a system and a method for voice signal restoration that are able to restore, with a high degree of accuracy, a target sound signal from a noise-containing sound signal without using prior information. A common sparse component estimation unit 7 that receives, as input, amplitude spectrograms of M channels is provided for estimating a common sparse component containing a sparse time-frequency component which is highly likely to be commonly included in the amplitude spectrograms of the largest number of channels among the M channels. A phase restoration unit 9 restores a phase of the common sparse component so as to obtain a target sound complex spectrogram. A target sound signal conversion unit 13 converts the target sound complex spectrogram into a target sound signal, which is a temporal signal.

Description

Target sound signal restoration system and method

The present invention relates to a target sound signal restoration system and method for restoring a target sound signal disturbed by low rank noise from observed multi-channel sound signals.

Non-Patent Document 1 discloses a technique capable of separating only a target sound signal from a multi-channel sound signal obtained by observing a mixed sound of noise and the target sound signal.

Non-Patent Document 2 discloses a technique for restoring audio signals included in acoustic signals of a plurality of channels collected by a plurality of microphones. Specifically, a noise suppression method in which RPCA (Robust Principal Component Analysis) is applied to a microphone array for speech enhancement of an acoustic signal recorded by a flexible cable-like robot equipped with a microphone array is disclosed. In this technology, first, RPCA is applied to each channel, and the median is extracted and integrated in order to extract the common component of each channel.

In the technique described in Non-Patent Document 3, speech enhancement is performed from low noise rank and speech sparsity without noise prior information.

Furthermore, in the technique described in Patent Document 1 (Japanese Patent No. 5752324), impulsive (sudden) noise is removed from a single channel acoustic signal. This technology has a high performance for removing impulsive noise, but on the other hand, the performance deteriorates for unexpected non-stationary noise (low rank noise).

In the technique described in Patent Document 2 (Japanese Patent Laid-Open No. 2009-116275), noise is suppressed from a single-channel acoustic signal based on the mean square error minimum method (MMSE). Since MMSE 仮定 assumes the stationary nature of noise, the performance degrades with the suppression of non-stationary noise.

In the technique described in Patent Document 3 (Japanese Patent Laid-Open No. 2014-503849), a technique is disclosed in which a handset microphone is placed near a noise source and voice enhancement is performed by actively using information of the microphone. . In this technique, the position of the noise source is specified, and it is necessary to arrange the handset microphone near the noise source.

Japanese Patent Laid-Open No. 2015-095897 discloses a technique for separating a background image and a moving object image by extracting a low-rank component and a sparse component from a video signal. Although this method can be applied to speech enhancement, performance is greatly degraded when some microphones cannot record sound sufficiently due to obstacles.

In the technique described in Patent Document 5 (Japanese Patent Laid-Open No. 2014-058399), each sound source signal is separated and extracted from a multi-channel acoustic signal obtained by observing a mixed sound of an arbitrary number of sound source signals. In the present technology, it is assumed that the position of each microphone and the sound source position are fixed, and the performance deteriorates when they move.

Japanese Patent No.5752324 JP2009-116275 JP 2014-503849 JP2015-095897 JP 2014-058399 JP

However, in the technique shown in Non-Patent Document 1, it is indispensable to record noise timbre information in advance in order to suppress noise and enhance speech, and the case where noise changes depending on the use environment of the system. It was difficult to use.

In the technique disclosed in Non-Patent Document 2, the target speech is emphasized from the multi-channel acoustic signal by using the low rank of noise and the sparseness of speech. In this technology, low-rank components and sparse components are separated from each other by using a robust principal component analysis for each channel amplitude spectrogram, and then a median value is selected for each sparse component for each microphone at each time frequency point. Was emphasized. In the present technology, since a plurality of microphones integrates the signals of all microphones with a median value, there is a problem that performance is greatly deteriorated when some microphones cannot sufficiently record sound due to an obstacle or the like.

The conventional technique shown in Non-Patent Document 3 specializes in the analysis of real-valued matrices, is not suitable for the analysis of non-negative matrixes that are amplitude spectrograms of acoustic signals, and is a multi-channel for acoustic signal processing. It was difficult to realize expansion and reliability estimation functions.

Conventionally, there has not been proposed a technique capable of robustly emphasizing voice even when voice is collected by a multi-channel microphone and even if some microphones cannot record the voice at a sufficiently high volume due to obstacles or the like. For example, a flexible cable-like rescue robot that enters a narrow gap in rubble and searches for a victim has a problem that it is difficult to hear the voice of the victim due to its own operational noise.

An object of the present invention is to provide an audio signal restoration system and method that can restore a target acoustic signal with high accuracy from an acoustic signal including noise without using prior information.

The present invention is directed to a target acoustic signal restoration system that restores a target acoustic signal included in an M channel acoustic signal collected by M microphones (M is an integer of 2 or more). An extraction unit, a common sparse component estimation unit, a phase restoration unit, and a target acoustic signal restoration unit are provided. The time-frequency analysis unit obtains an M-channel complex spectrogram by performing time-frequency analysis on an M-channel acoustic signal collected by M microphones. The amplitude component extraction unit extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram. The common sparse component estimator receives, as an input, the amplitude spectrogram of the M channel, and a common sparse component including a sparse frequency component that is likely to be included in the amplitude spectrogram of the most channels among the amplitude spectrograms of the M channel. presume. Here, as a representative example of the “amplitude spectrogram of the most channels among M-channel amplitude spectrograms”, if, for example, two microphones out of M microphones hardly collect an acoustic signal, A two-channel amplitude spectrogram obtained from an acoustic signal from two microphones is not included in the “amplitude spectrogram of most channels”. The amplitude spectrogram of the M-2 channel obtained from the acoustic signal of the M-2 channel obtained from the acoustic signals collected by the remaining M-2 microphones becomes the “amplitude spectrogram of the most channels”. The phase restoration unit restores the phase of the common sparse component to obtain a target acoustic complex spectrogram. The phase may be estimated from the amplitude spectrogram of the M channel, the common sparse component, etc., and the method for obtaining the phase is arbitrary. The target acoustic signal conversion unit converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal.

In the present invention, the common sparse component estimation unit estimates the common sparse component on the assumption that the acoustic signal of each channel is decomposed into a low rank component and a common sparse component common between channels, and the common sparse component Noise is suppressed by restoring the phase and converting the restored target acoustic complex spectrogram into a target acoustic signal. The common sparse component is modeled as different from the content ratio of the sparse component of each channel included in the common sparse component (this content ratio may be referred to as “volume” in the present specification). By estimating the content ratio (volume), robust target sound signal enhancement is realized even when there is a microphone that cannot sufficiently record the target sound signal.

Specifically, in the present invention, speech enhancement is performed from the low rank of noise and the sparseness of the target acoustic signal without the prior information of noise. In the present invention, since the common sparse component including the sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels is estimated, the influence of the low rank component is minimized. The target acoustic signal can be restored without receiving it. As a result, the restoration accuracy can be made higher than before.

The present invention specializes in the analysis of amplitude spectrograms that are non-negative real-valued matrices, and does not depend on the system usage environment such as the placement of microphones. Even when information cannot be obtained, it is possible to realize target sound signal enhancement that operates robustly. Furthermore, even if some microphones cannot record sound at a sufficiently high volume due to obstacles or the like, the target acoustic signal can be robustly enhanced. For example, a flexible cable rescue robot that intrudes into a narrow gap in the rubble and searches for the victim has a problem that it is difficult to hear the voice of the victim due to its own operating noise. Can be emphasized.

The common sparse component estimator obtains the sum of the residuals including the M sparse components obtained by removing the low-rank component of the i-th iteration estimation from the amplitude spectrogram of the M channel, and uses the M sparse components as the common sparse component. The result obtained by dividing by the sum of the content ratios (sound volumes) containing is estimated as the sparse common component for the i-th iteration. This content ratio gradually converges in the process of iterative estimation using an iterative estimation method such as the variational Bayes EM method or the sequential variational Bayes EM method.

When the iterative estimation method is used, the common sparse component estimation unit includes a low rank component ratio calculation unit that calculates a ratio of low rank components included in the amplitude spectrogram of the M channel, and an amplitude of the M channel based on the ratio of the low rank components. A low rank component calculation unit that calculates M low rank components included in the spectrogram, a sparse component ratio calculation unit that calculates a ratio of sparse components included in the amplitude spectrogram of the M channel, and M based on the sparse component ratio Based on the residual component calculation unit that calculates the residual component including M sparse components included in the amplitude spectrogram of the channel, and the residual component including M sparse components and the common sparse component, M common sparse components Volume calculation unit for calculating the content ratio containing sparse components as the volume of M sparse components , And a common sparse component calculator for calculating a common sparse component the sum of the residual component containing M sparse component is divided by the sum of the volume of M sparse component. And a common sparse component is estimated by performing iterative calculation in a low rank component ratio calculation part, a low rank component calculation part, a sparse component ratio calculation part, a residual component calculation part, a volume calculation part, and a common sparse component calculation part. When the iterative estimation method is used in such a configuration, the common sparse component can be estimated with high accuracy by repeating an appropriate iterative operation a predetermined number of times.

The common sparse component estimator can be configured by, for example, a Bayesian estimator that Bayes estimates common sparse components by a variational Bayes EM method or a sequential variational Bayes EM method. If a Bayesian estimator is used, a common sparse component can be easily estimated.

The present invention can also be specified as a target acoustic signal restoration method for restoring a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones using a computer. . In this method, a time-frequency analysis step for obtaining an M-channel complex spectrogram by performing time-frequency analysis of an M-channel acoustic signal collected by M microphones, and an amplitude for extracting an M-channel amplitude spectrogram from the M-channel complex spectrogram. A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M channel amplitude spectrograms, with the component extraction step and the M channel amplitude spectrograms as inputs. A common sparse component estimation step to be estimated, a phase restoration step that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram, and a target acoustic signal variable that converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal. Steps and are performed by a computer.

In the common sparse component estimation step, the sum of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation i-1 from the amplitude spectrogram of the M channel is used as the common sparse component. The result obtained by dividing the sum by the sum of the content ratios that include the is estimated as the sparse common component for the i-th iteration.

The common sparse component estimating step includes a low rank component ratio calculating step of calculating a ratio of low rank components included in the amplitude spectrogram of the M channel, and M pieces of M spectra included in the amplitude spectrogram of the M channel based on the ratio of the low rank components. Included in the M-channel amplitude spectrogram based on the low-rank component calculation step for calculating the low-rank component, the sparse component ratio calculation step for calculating the ratio of the sparse component included in the M-channel amplitude spectrogram, and the sparse component ratio The common sparse component includes M sparse components based on the residual component calculation step for calculating the residual component including M sparse components, and the residual component including M sparse components and the common sparse component. Volume that calculates the content ratio as the volume of M sparse components And calculation steps, and a common sparse component calculation step by dividing the sum of the residual component containing M sparse component at the volume of the sum of the M sparse component calculates the common sparse component. The common sparse component is estimated by performing iterative calculations in the low rank component ratio calculation step, the low rank component calculation step, the sparse component ratio calculation step, the residual component calculation step, the volume calculation step, and the common sparse component calculation step.

The present invention provides a computer-readable method for realizing a target acoustic signal restoration method for restoring a target acoustic signal included in an M-channel acoustic signal collected by M microphones (M is an integer of 2 or more) using a computer. It can also be specified as a computer program stored in a possible storage means. This computer program performs a time-frequency analysis step for obtaining an M-channel complex spectrogram by performing time-frequency analysis on an M-channel acoustic signal collected by M microphones, and extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram. Using the amplitude component extraction step and the M-channel amplitude spectrogram as inputs, estimate the common sparse component including the sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms. A common sparse component estimation step, a phase restoration step that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram, and a purpose to convert the target acoustic complex spectrogram into a target acoustic signal that is a time signal To achieve a sound signal conversion step on the computer.

It is a block diagram which shows the structure of an example of embodiment of the objective acoustic signal restoration system of this invention. It is a flowchart which shows the algorithm of the computer program used when implement | achieving the common sparse component estimation part of FIG. 1 by the iterative estimation method using a computer. It is a photograph of a flexible cable robot equipped with a microphone array. It is a figure used in order to explain arrangement conditions 1 and 2 of a robot and a speaker (sound source) that reproduces sound. (A) And (B) is a figure which shows the audio | voice emphasis performance on the arrangement | positioning conditions of condition 1 and condition 2, and SNR conditions with a signal-to-distortion ratio (SDR). It is the figure which extracted 4 channels in 8 channels of the multi-channel amplitude spectrogram. It is the figure which extracted low rank component Lm of 4 channels among 8 channels. It is the figure which showed the sound volume gm in the common sparse component of 4 channels among 8 channels. It is a figure which shows a common sparse component. It is a figure which shows the emphasis result in MNMF IV. It is a figure which shows the emphasis result in Med-RPCA IV. It is a figure which shows the emphasis result in RPCA IV. It is a figure which shows the result of having decompress | restored the phase of a common sparse component, setting it as the target acoustic complex spectrogram, and converting this target acoustic complex spectrogram into the target acoustic signal which is a time signal.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

(Configuration of the embodiment)
FIG. 1 is a block diagram showing a configuration of an example of an embodiment of a target acoustic signal restoration system of the present invention realized by using a computer or a plurality of processors and a plurality of memories. In the present embodiment, specifically, a speech signal is extracted as a target acoustic signal from a multi-channel acoustic signal obtained by observing speech disturbed by low rank noise. Each microphone moves, and even if some of the microphones cannot observe the sound at a sufficiently large volume due to the obstacle, it is possible to extract the sound signal robustly. The target acoustic signal restoration system 1 according to the present embodiment restores a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones.

The target acoustic signal restoration system 1 according to the present embodiment is realized by a computer or each using one or more processors and one or more memories, and a time frequency analysis unit 3 and an amplitude component extraction unit 5. A common sparse component estimation unit 7, a phase restoration unit 9, a phase component extraction unit 11, and a target acoustic signal conversion unit 13. The time-frequency analysis unit 3 is, for example, M books provided in the flexible cable-shaped robot shown in Non-Patent Document 2 “Speech enhancement for flexible cable-shaped robot based on motion noise suppression using robust principal component analysis”. The M channel acoustic signal collected by the microphone (M is an integer of 2 or more) is subjected to time-frequency analysis to obtain an M channel complex spectrogram. The amplitude component extraction unit 5 extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram. The common sparse component estimation unit 7 receives an M-channel amplitude spectrogram, and includes a common sparse component including a sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms. Is estimated. The phase restoration unit 9 restores the phase of the common sparse component to obtain a target acoustic complex spectrogram. In the present embodiment, the phase information is extracted from the time frequency analysis unit 3 by the phase component extraction unit 11. The phase may be estimated from the amplitude spectrogram of the M channel, the common sparse component, etc., and the method for obtaining the phase is arbitrary. Therefore, the present invention is not limited to the provision of the phase component extraction unit 11. The target acoustic signal conversion unit 13 converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal.

In this embodiment, the common sparse component estimation unit 7 uses an iterative estimation method. Therefore, the common sparse component estimation unit 7 includes a low rank component ratio calculation unit 71, a low rank component calculation unit 72, a sparse component ratio calculation unit 73, a residual component calculation unit 74, a volume calculation unit 75, and a common sparse component. And an arithmetic unit 76. The low rank component ratio calculation unit 71 calculates the ratio of the low rank components included in the amplitude spectrogram of the M channel. The low rank component calculation unit 72 calculates M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components. The sparse component ratio calculation unit 73 calculates the ratio of sparse components included in the amplitude spectrogram of the M channel. Then, the residual component calculation unit 74 calculates a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components. The volume calculation unit 75 calculates a content ratio in which the M sparse components are included in the common sparse component as the volume of the M sparse components based on the residual component including the M sparse components and the common sparse component. To do. This content ratio is obtained in the process of iterative estimation when using an iterative estimation method such as a variational Bayes EM method or a sequential variational Bayes EM method. The common sparse component calculation unit 76 uses the sum of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation from the amplitude spectrogram of the M channel as M common sparse components. The result obtained by dividing by the sum of the content ratios containing the components is calculated as a common sparse component for i-th iteration estimation.

Then, the low rank component ratio calculation unit 71, the low rank component calculation unit 72, the sparse component ratio calculation unit 73, the residual component calculation unit 74, the volume calculation unit 75, and the common sparse component calculation unit 76 perform an iterative calculation to thereby perform a common sparse component. Is estimated.

In the present embodiment, the common sparse component estimation unit 7 estimates the common sparse component on the assumption that “the acoustic signal of each channel is decomposed into a low rank component and a common sparse component common between channels”. The phase of the sparse component is restored by the phase restoring unit 9, and the restored target acoustic complex spectrogram is converted by the target acoustic signal converting unit 13 to perform noise suppression. In the present embodiment, the common sparse component is modeled as different from the content ratio (volume) of the sparse component of each channel included in the common sparse component. Even if there is a microphone that cannot sufficiently record acoustic signals (speech, etc.), robust speech enhancement is realized.

Specifically, in the present invention, the target sound signal (speech etc.) is enhanced from the low rank of noise and the sparsity of the target sound signal (speech etc.) without prior information of noise. In the present invention, since the common sparse component including the sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels is estimated, the influence of the low rank component is minimized. The target acoustic signal can be restored without receiving it. As a result, the restoration accuracy can be made higher than before.

This embodiment specializes in the analysis of non-negative real value matrices, and is robust from the low rank of noise and the sparseness of the target acoustic signal such as speech without having to record the timbre information of the noise signal in advance. Target sound signal enhancement can be realized. With this feature, it is possible to realize enhancement of target acoustic signals (speech etc.) that operate robustly even in environments where there are many obstacles around the microphone array or when information on noise cannot be obtained in advance. Furthermore, even when some microphones cannot record sound at a sufficiently high volume due to obstacles or the like, the target sound signal (sound or the like) can be strongly emphasized.

(Description of generation model)
A generation model employed in the present embodiment will be described. In the following description, the target acoustic signal may be referred to as an audio signal for convenience. The speech enhancement problem handled by this generation model is defined below.

Input: M channel amplitude spectrogram

Output: Amplitude spectrogram of target speech

here,

Represents a non-negative real value. F and T represent the number of frequency bins and the number of time frames, respectively. In the following description, an amplitude spectrogram of a channel is called an acoustic signal, and f and t are frequency bins and time frame indexes.

In this embodiment, the audio signal _{S = [s 1, ···,} s T] Observation at its microphone _{_{Y'm = [y'm1, ···}} , y'mT] relationship, the frequency non Assume a dependent time-varying linear transformation.

here,

Represents the coefficient of this linear conversion (volume of sound in each microphone: content ratio of sparse component in each microphone). Using this audio transmission model, it is assumed that the microphone observation y _mt is decomposed as follows.

Here, L _m = [l _m1 ,..., L _mT ] and S = [s ₁ ,..., S _T ] respectively represent low rank noise mixed in each microphone and a sparse component representing speech. . This low rank component further includes K fundamental lows W _m = [w _m1 ,..., W _mK ] (basic low matrix) and volume H _m = [h _m1 ,. •, h _mT ] (activation matrix).

Hereinafter, a Bayesian generation model for modeling the low rank likelihood of the low rank components of each channel and the sparseness of the common sparse components will be described.

(Likelihood model)
In this model, the approximation error of the input amplitude spectrogram is evaluated using the Kullback-Leibler (KL) pseudorange. In the Bayesian generation model, minimizing the KL pseudorange corresponds to maximizing the Poisson distribution likelihood. In this model, the likelihood model is defined as follows.

Where P (x | k) is a parameter

Represents a Poisson distribution with. In equation (4), Ym is the mth amplitude spectrogram of the M channel amplitude spectrogram, Hm is the volume of each base at each time, Wm is the K bases, S is the common sparse component, and Gm is the sparse component. It is the content ratio in the microphone. _{_{_{s ft, g mt, w mfk}}} , h mkt each represents S, Gm, Wm, the elements of Hm, y _mft denotes a complex spectrogram of the observation.

(Prior distribution of low rank components)
The prior distribution of the low-rank component basis matrix and activation matrix is formulated using the Gamma distribution, which is the conjugate prior distribution of the Poisson distribution.

Here, G (x | α, β) represents a gamma distribution having a shape parameter α and a rate parameter β. Also,

and

Represents the hyperparameters of the basis and activation. In this model, by setting the shape parameter to 1 or less, the basis and the activation matrix can be limited to sparse, whereby the low rank component L is limited to the low rank matrix.

(A prior distribution of sparse components)
In the generation model of Bayesian RPCA, which is one of the conventional methods, the Gaussian prior distribution is placed in the representation of the sparse component, and the sparse component is represented by placing the Jeffreys hyper prior distribution in the accuracy parameter. In this embodiment, in order to express the amplitude spectrogram that is a non-negative matrix, the Gamma distribution is placed in the prior distribution of the sparse component, and the Jeffreys hyper prior distribution is placed in the rate parameter of the Gamma distribution corresponding to the accuracy parameter in the Gaussian distribution. To model the sparse component.

here,

Represents the hyperparameter of the gamma distribution. In the proposed model, the sparseness of the sparse component is adjusted by the value of this hyperparameter.

(Prior distribution of volume variable)
Gamma distribution, which is a conjugate prior distribution of Poisson distribution, is placed in the volume (content ratio) g _mt of the sparse component of each microphone.

Here, α ^g is a super parameter for adjusting the variation in the volume of the sparse component in each microphone.

(Bayesian inference by variational Bayes EM method)
Since it is difficult to analytically derive the posterior distribution of this model when the input multichannel amplitude spectrogram is obtained, approximate inference is performed using the variational Bayes EM method. Below

Represents a set of all parameters, and q (•) is a variational posterior distribution. In variational approximation, the posterior distribution is decomposed and approximated as follows, and inference is performed by minimizing the KL pseudorange with the true posterior distribution.

Since the model of this embodiment is modeled on the conjugate exponential distribution family, each update rule can be easily derived by using Jensen's inequality and Lagrange's undetermined multiplier method. When 〈·〉 is the mean of random variables, each posterior distribution is obtained by iteratively updating the following update rule with other parameters fixed.

Where s´ _mft represents the remainder of the low rank component,

and

Respectively represent a ratio including a low-rank component and a ratio including a sparse component.

FIG. 2 is a flowchart showing an algorithm of a computer program used when the common sparse component estimation unit 7 of FIG. 1 is realized by an iterative estimation method using a computer. In FIG. 2, each expression is attached to the steps in which the above expressions (11) to (18) are used. The iterative estimation termination condition is repeated 200 times, or each estimated value data Ym, Hm, Wm, S, β ^s , g _m is compared with the previous processing, and ends when the comparison result is approximate. It was supposed to be. In FIG. 2, Ym is the m-th amplitude spectrogram of the M-channel amplitude spectrogram, Hm is the volume of each base at each time, Wm is K bases, S is the common sparse component, and β ^s is Bayesian estimation. The coefficient, g _m, is the content ratio of the sparse component.

The calculation of the sum total of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation from the M channel amplitude spectrograms, respectively, performed by the common sparse component calculation unit 76 is given by Equation (13). In calculating the average of the updated variational posterior distribution, the calculation is performed using the expression before “,” in Expression (13). Then, the operation of dividing the sum of the residuals by the sum of the content ratios (volumes) in which M sparse components are included in the common sparse component is to calculate the average of the variational posterior distribution obtained by Equation (13) To be implemented. This result is estimated as a common sparse component of the i-th iteration estimation. The sum of the content ratios [Σ (gmt) in equation (13)] gradually converges in the process of iterative estimation when an iterative estimation method such as the variational Bayes EM method is used. Β ^s _ft in equation (13) is a coefficient for Bayesian estimation.

In the case of FIG. 2, when the termination condition is satisfied, the common sparse component S is given to the phase restorer 9, and the phase restorer 9 calculates the target acoustic spectrogram s ′ _ft (output) by Equation (19).

In the above equation, s _ft represents an element of S, and y ′ _mft represents an element of an observation complex spectrogram.

In addition to the common sparse component S, the phase restoration unit 9 includes the target acoustic spectrogram s ′ _ft (output) including the volume Hm of each base at each time, K bases Wm, and the content ratio g _m of the sparse components. May be restored. In this case, the phase restoration unit 9 calculates the target acoustic spectrogram S _ft using equation (20).

Note In the above _{_{formula, s ft, g mt, w}} mfk, h mkt each represents S, g _m, Wm, the elements of Hm, y _'mft represents an element of the complex spectrogram of the observation.

[Evaluation experiment]
The speech enhancement performance of the present embodiment was evaluated using the operation noise of a flexible cable-like robot with a drive mechanism and a microphone array.

(Flexible cable robot used)
FIG. 3 shows a photograph of a flexible cable-like robot equipped with a microphone array. The main body consists of a corrugated tube with a diameter of 38 mm and has a total length of 3 m. Eight microphone arrays (M = 8) were mounted on the robot surface by rotating 90 degrees at 40 cm intervals. The distance between the microphones at both ends is 2.8 m. The microphones are distinguished by the index m in order from the hand (m = 1,..., M). This robot is similar to Namari et al.'S Tube-type Active Scope Camera [J. Fukuda, et al. Remote vertical exploration by active scope camera into collapsed buildings. In IEEE / RSJ IROS, pp. 1882-1888, 2014.] Advancing with drive using cilia and vibration motor. Seven vibration motors are mounted in series in the robot at intervals of 40 cm.

(Experimental settings)
(Recording conditions)
Voice and motion noise were individually recorded using a flexible cable-shaped robot, and the SNR was changed by 5 dB from -20 dB to +5 dB and mixed to evaluate the speech enhancement performance. As shown in FIG. 4, the arrangement conditions of the robot and the speaker (sound source) for reproducing the sound are defined as Condition 1 and Condition 2.

Requirement 1: The robot is placed in free space and the sound source is placed in front of the robot. The room reverberation time (RT60) was 750ms.

Requirement 2: The robot is placed in the door gap and the sound source is placed in front of the robot. Four microphones are blocked from the sound source by the door. The reverberation time (RT60) was 990 ms.

¡The robot was driven, and 60-second operation noise was recorded while shaking the robot left and right using the hand and vibration motor. In order to reduce noise, the target sound was created by convolving a 60-second recording with the impulse response when the robot was stationary. A total of 4 types (240 msec) of male voice and 2 female voices were used for recording. These recordings were performed with 8ch synchronization, 24-bit quantization, and 16 kHz sampling.

(Comparison method)
In the experiment, an example of this embodiment and Multi-channel non-negative matrix factorization (MNMF) [D. Kitamura, et al. Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model. In IEEE ICASSP, pp. 276 -280, 2015.] and robust principal component analysis (RPCA) [C. Sun, et al. Noise reduction based on robust principal component analysis. JCIS, Vol. 10, No. 10, pp.4403-4410, 2014.2] The comparative example was compared. RPCA used the result of the tip microphone. In addition, we compared the result of RPCA of all microphones with the median (Med-RPCA). The example of this embodiment is an offline implementation of the conventional method [Yoshiaki Bando et al. Speech enhancement for flexible cable-like robots based on motion noise suppression using robust principal component analysis. In RSJ2015].

(Evaluation scale)
The signal-to-distortion ratio (SDR) was used as the evaluation scale. Signal to distortion ratio (SDR) represents the overall separation accuracy.

(Experimental result)
FIG. 5 shows the speech enhancement performance under the condition 1 and the arrangement condition 2 and the SNR condition as a signal-to-distortion ratio (SDR). When the speech enhancement performance under each arrangement / SNR condition is shown in SDR, the higher the SDR, the better the speech enhancement performance. In other words, the higher the SNR, the more voices are included. In the condition, when the SNR is 0 dB or less, and in the condition 2, when the SNR is -15 dB or more and 0 dB or less, the performance of the example of the present embodiment (proposed method) is the highest. On the other hand, in both the condition 1 and the condition 2, in the comparative example of Med-RPCA having the second highest performance, the performance is greatly deteriorated in the condition 2 where some microphones are hidden. On the other hand, RPCA, which has the third highest performance under Condition 2, has lower performance than Med-RPCA and the proposed method under Condition 2. Compared to these, the example of the present embodiment (the proposed method) shows high performance under both conditions, and it can be seen that the dependence on the environment is low.

6 to 13 show the speech enhancement results according to the example of the present embodiment (the proposed method) and the enhancement results according to the conventional method as waveforms. From these waveforms, it can be seen that the example of the present embodiment can suppress noise most and emphasize the voice. FIG. 6 is an excerpt of 4 channels out of 8 channels of multi-channel amplitude spectrogram Ym (m = 1,..., 8). FIG. 7 is an extract of low rank components Lm 成分 of 4 channels out of 8 channels. FIG. 8 shows the volume gm with the common sparse component of 4 channels out of 8 channels. FIG. 9 shows a sparse common sparse component. FIG. 10 shows the enhancement result with MNMF, FIG. 11 shows the enhancement result with Med-RPCA, and FIG. 12 shows the enhancement result with RPCA. FIG. 13 shows a result of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram, and converting the target acoustic complex spectrogram into a target acoustic signal that is a time signal.

In the present invention, speech enhancement is performed without noise prior information from the low rank of noise and the sparsity of the target acoustic signal. Therefore, it is commonly included in the amplitude spectrogram of the most channels among the M channel amplitude spectrograms. Estimate common sparse components, including likely sparse frequency components. Then, by restoring the phase of the common sparse component and converting the restored target acoustic complex spectrogram to the target acoustic signal, noise suppression is performed, so that the target acoustic signal is recovered without being affected by the low-rank component as much as possible. And the accuracy of restoration can be made higher than before.

DESCRIPTION OF SYMBOLS 1 Target acoustic signal restoration system 3 Time frequency analysis part 5 Amplitude component extraction part 7 Common sparse component estimation part 9 Phase restoration part 11 Phase component extraction part 13 Target acoustic signal conversion part 71 Low rank component ratio calculation part 72 Low rank component calculation part 73 Sparse Component Ratio Calculation Unit 74 Residual Component Calculation Unit 75 Volume Calculation Unit 76 Common Sparse Component Calculation Unit

Claims

A target acoustic signal restoration system for restoring a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones,
A time-frequency analysis unit that obtains an M-channel complex spectrogram by performing time-frequency analysis of the M-channel acoustic signal collected by the M microphones (M is an integer of 2 or more);
An amplitude component extractor for extracting an M channel amplitude spectrogram from the M channel complex spectrogram;
A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels, using the amplitude spectrogram of the M channels as an input. A component estimation unit;
A phase restoring unit that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram;
A target sound signal restoration system comprising: a target sound signal conversion unit that converts the target sound complex spectrogram into the target sound signal that is a time signal.
The common sparse component estimation unit sums the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation (i is a positive number of 2 or more) from the amplitude spectrogram of the M channel. 2. The target sound according to claim 1, wherein a result obtained by dividing the common sparse component by the sum of the content ratios in which the M sparse components are included in the common sparse component is estimated as the common sparse component for i-th iteration estimation. Signal restoration system.
The common sparse component estimator is
A low rank component ratio calculation unit for calculating a ratio of low rank components included in the amplitude spectrogram of the M channel;
A low rank component calculation unit for calculating M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components;
A sparse component ratio calculation unit for calculating a ratio of sparse components included in the amplitude spectrogram of the M channel;
A residual component calculation unit that calculates a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components;
Based on the residual component including the M sparse components and the common sparse component, a content ratio in which the M sparse components are included in the common sparse component is calculated as a volume of the M sparse components. A volume calculator,
A common sparse component computing unit that computes the common sparse component by dividing the sum of the residual components including the M sparse components by the sum of the volumes of the M sparse components,
The common sparse component is estimated by performing iterative calculations in the low rank component ratio calculation unit, the low rank component calculation unit, the sparse component ratio calculation unit, the residual component calculation unit, the volume calculation unit, and the common sparse component calculation unit. Item 3. The audio signal restoration system according to Item 2.
The speech signal restoration system according to claim 1, wherein the common sparse component estimation unit includes a Bayes estimator that performs Bayes estimation of the common sparse component by a variational Bayes EM method or a sequential variational Bayes EM method.
A target acoustic signal restoration method for restoring a target acoustic signal contained in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones using a computer,
A time-frequency analysis step of obtaining a M-channel complex spectrogram by performing time-frequency analysis of the M-channel acoustic signal collected by the M microphones (M is an integer of 2 or more);
An amplitude component extracting step of extracting an M channel amplitude spectrogram from the M channel complex spectrogram;
A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels, using the amplitude spectrogram of the M channels as an input. Component estimation step;
A phase restoration step of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram;
A target sound signal restoring method comprising: a target sound signal converting step of converting the target sound complex spectrogram to the target sound signal which is a time signal.
In the common sparse component estimation step, a residual sum including M sparse components obtained by removing low rank components of the i-th iteration estimation i-1 from the amplitude spectrogram of the M channel is used as the common sparse component. The target acoustic signal restoration method according to claim 5, wherein a result obtained by dividing the sum of the content ratios including individual sparse components is estimated as the i-th repeated estimation i-th common sparse component.
The common sparse component estimation step includes:
A low rank component ratio calculating step of calculating a ratio of low rank components included in the amplitude spectrogram of the M channel;
A low rank component calculation step of calculating M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components;
A sparse component ratio calculating step of calculating a ratio of sparse components included in the amplitude spectrogram of the M channel;
A residual component calculating step of calculating a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components;
Based on the residual component including the M sparse components and the common sparse component, a content ratio in which the M sparse components are included in the common sparse component is calculated as a volume of the M sparse components. A volume calculation step;
A common sparse component calculation step of calculating the common sparse component by dividing the sum of the residual components including the M sparse components by the sum of the volume of the M sparse components;
The common sparse component by performing an iterative operation in the low rank component ratio calculation step, the low rank component calculation step, the sparse component ratio calculation step, the residual component calculation step, the volume calculation step, and the common sparse component calculation step. The audio signal restoration method according to claim 6, wherein
A computer-readable storage means for realizing a target acoustic signal restoration method for restoring a target acoustic signal included in an M channel acoustic signal collected by M microphones (M is an integer of 2 or more) using a computer A computer program stored in
A time-frequency analysis step of obtaining a M-channel complex spectrogram by performing time-frequency analysis of the M-channel acoustic signal collected by the M microphones (M is an integer of 2 or more);
An amplitude component extracting step of extracting an M channel amplitude spectrogram from the M channel complex spectrogram;
A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels, using the amplitude spectrogram of the M channels as an input. Component estimation step;
A phase restoration step of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram;
A computer program for restoring a target sound signal for realizing, in the computer, a target sound signal conversion step of converting the target sound complex spectrogram into the target sound signal which is a time signal.