WO2017204226A1 - System and method for target sound signal restoration - Google Patents

System and method for target sound signal restoration Download PDF

Info

Publication number
WO2017204226A1
WO2017204226A1 PCT/JP2017/019259 JP2017019259W WO2017204226A1 WO 2017204226 A1 WO2017204226 A1 WO 2017204226A1 JP 2017019259 W JP2017019259 W JP 2017019259W WO 2017204226 A1 WO2017204226 A1 WO 2017204226A1
Authority
WO
WIPO (PCT)
Prior art keywords
component
sparse
channel
components
spectrogram
Prior art date
Application number
PCT/JP2017/019259
Other languages
French (fr)
Japanese (ja)
Inventor
宜昭 坂東
和佳 吉井
克寿 糸山
博 奥乃
Original Assignee
国立大学法人京都大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人京都大学 filed Critical 国立大学法人京都大学
Priority to JP2018519566A priority Critical patent/JP6886720B2/en
Publication of WO2017204226A1 publication Critical patent/WO2017204226A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the present invention relates to a target sound signal restoration system and method for restoring a target sound signal disturbed by low rank noise from observed multi-channel sound signals.
  • Non-Patent Document 1 discloses a technique capable of separating only a target sound signal from a multi-channel sound signal obtained by observing a mixed sound of noise and the target sound signal.
  • Non-Patent Document 2 discloses a technique for restoring audio signals included in acoustic signals of a plurality of channels collected by a plurality of microphones. Specifically, a noise suppression method in which RPCA (Robust Principal Component Analysis) is applied to a microphone array for speech enhancement of an acoustic signal recorded by a flexible cable-like robot equipped with a microphone array is disclosed. In this technology, first, RPCA is applied to each channel, and the median is extracted and integrated in order to extract the common component of each channel.
  • RPCA Robot Principal Component Analysis
  • Non-Patent Document 3 speech enhancement is performed from low noise rank and speech sparsity without noise prior information.
  • Patent Document 1 Japanese Patent No. 57523264
  • impulsive (sudden) noise is removed from a single channel acoustic signal.
  • This technology has a high performance for removing impulsive noise, but on the other hand, the performance deteriorates for unexpected non-stationary noise (low rank noise).
  • Patent Document 3 Japanese Patent Laid-Open No. 2014-503849
  • a technique is disclosed in which a handset microphone is placed near a noise source and voice enhancement is performed by actively using information of the microphone.
  • the position of the noise source is specified, and it is necessary to arrange the handset microphone near the noise source.
  • Japanese Patent Laid-Open No. 2015-095897 discloses a technique for separating a background image and a moving object image by extracting a low-rank component and a sparse component from a video signal. Although this method can be applied to speech enhancement, performance is greatly degraded when some microphones cannot record sound sufficiently due to obstacles.
  • each sound source signal is separated and extracted from a multi-channel acoustic signal obtained by observing a mixed sound of an arbitrary number of sound source signals.
  • Non-Patent Document 1 it is indispensable to record noise timbre information in advance in order to suppress noise and enhance speech, and the case where noise changes depending on the use environment of the system. It was difficult to use.
  • the target speech is emphasized from the multi-channel acoustic signal by using the low rank of noise and the sparseness of speech.
  • low-rank components and sparse components are separated from each other by using a robust principal component analysis for each channel amplitude spectrogram, and then a median value is selected for each sparse component for each microphone at each time frequency point.
  • a median value is selected for each sparse component for each microphone at each time frequency point.
  • Non-Patent Document 3 specializes in the analysis of real-valued matrices, is not suitable for the analysis of non-negative matrixes that are amplitude spectrograms of acoustic signals, and is a multi-channel for acoustic signal processing. It was difficult to realize expansion and reliability estimation functions.
  • An object of the present invention is to provide an audio signal restoration system and method that can restore a target acoustic signal with high accuracy from an acoustic signal including noise without using prior information.
  • the present invention is directed to a target acoustic signal restoration system that restores a target acoustic signal included in an M channel acoustic signal collected by M microphones (M is an integer of 2 or more).
  • An extraction unit, a common sparse component estimation unit, a phase restoration unit, and a target acoustic signal restoration unit are provided.
  • the time-frequency analysis unit obtains an M-channel complex spectrogram by performing time-frequency analysis on an M-channel acoustic signal collected by M microphones.
  • the amplitude component extraction unit extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram.
  • the common sparse component estimator receives, as an input, the amplitude spectrogram of the M channel, and a common sparse component including a sparse frequency component that is likely to be included in the amplitude spectrogram of the most channels among the amplitude spectrograms of the M channel.
  • a common sparse component including a sparse frequency component that is likely to be included in the amplitude spectrogram of the most channels among the amplitude spectrograms of the M channel.
  • the amplitude spectrogram of the M-2 channel obtained from the acoustic signal of the M-2 channel obtained from the acoustic signals collected by the remaining M-2 microphones becomes the “amplitude spectrogram of the most channels”.
  • the phase restoration unit restores the phase of the common sparse component to obtain a target acoustic complex spectrogram.
  • the phase may be estimated from the amplitude spectrogram of the M channel, the common sparse component, etc., and the method for obtaining the phase is arbitrary.
  • the target acoustic signal conversion unit converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal.
  • the common sparse component estimation unit estimates the common sparse component on the assumption that the acoustic signal of each channel is decomposed into a low rank component and a common sparse component common between channels, and the common sparse component Noise is suppressed by restoring the phase and converting the restored target acoustic complex spectrogram into a target acoustic signal.
  • the common sparse component is modeled as different from the content ratio of the sparse component of each channel included in the common sparse component (this content ratio may be referred to as “volume” in the present specification). By estimating the content ratio (volume), robust target sound signal enhancement is realized even when there is a microphone that cannot sufficiently record the target sound signal.
  • speech enhancement is performed from the low rank of noise and the sparseness of the target acoustic signal without the prior information of noise.
  • the common sparse component including the sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels is estimated, the influence of the low rank component is minimized.
  • the target acoustic signal can be restored without receiving it. As a result, the restoration accuracy can be made higher than before.
  • the present invention specializes in the analysis of amplitude spectrograms that are non-negative real-valued matrices, and does not depend on the system usage environment such as the placement of microphones. Even when information cannot be obtained, it is possible to realize target sound signal enhancement that operates robustly. Furthermore, even if some microphones cannot record sound at a sufficiently high volume due to obstacles or the like, the target acoustic signal can be robustly enhanced. For example, a flexible cable rescue robot that intrudes into a narrow gap in the rubble and searches for the victim has a problem that it is difficult to hear the voice of the victim due to its own operating noise. Can be emphasized.
  • the common sparse component estimator obtains the sum of the residuals including the M sparse components obtained by removing the low-rank component of the i-th iteration estimation from the amplitude spectrogram of the M channel, and uses the M sparse components as the common sparse component.
  • the result obtained by dividing by the sum of the content ratios (sound volumes) containing is estimated as the sparse common component for the i-th iteration. This content ratio gradually converges in the process of iterative estimation using an iterative estimation method such as the variational Bayes EM method or the sequential variational Bayes EM method.
  • the common sparse component estimation unit includes a low rank component ratio calculation unit that calculates a ratio of low rank components included in the amplitude spectrogram of the M channel, and an amplitude of the M channel based on the ratio of the low rank components.
  • a low rank component calculation unit that calculates M low rank components included in the spectrogram, a sparse component ratio calculation unit that calculates a ratio of sparse components included in the amplitude spectrogram of the M channel, and M based on the sparse component ratio Based on the residual component calculation unit that calculates the residual component including M sparse components included in the amplitude spectrogram of the channel, and the residual component including M sparse components and the common sparse component, M common sparse components Volume calculation unit for calculating the content ratio containing sparse components as the volume of M sparse components , And a common sparse component calculator for calculating a common sparse component the sum of the residual component containing M sparse component is divided by the sum of the volume of M sparse component.
  • a common sparse component is estimated by performing iterative calculation in a low rank component ratio calculation part, a low rank component calculation part, a sparse component ratio calculation part, a residual component calculation part, a volume calculation part, and a common sparse component calculation part.
  • the iterative estimation method is used in such a configuration, the common sparse component can be estimated with high accuracy by repeating an appropriate iterative operation a predetermined number of times.
  • the common sparse component estimator can be configured by, for example, a Bayesian estimator that Bayes estimates common sparse components by a variational Bayes EM method or a sequential variational Bayes EM method. If a Bayesian estimator is used, a common sparse component can be easily estimated.
  • the present invention can also be specified as a target acoustic signal restoration method for restoring a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones using a computer.
  • M is an integer of 2 or more
  • a time-frequency analysis step for obtaining an M-channel complex spectrogram by performing time-frequency analysis of an M-channel acoustic signal collected by M microphones, and an amplitude for extracting an M-channel amplitude spectrogram from the M-channel complex spectrogram.
  • a common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M channel amplitude spectrograms, with the component extraction step and the M channel amplitude spectrograms as inputs.
  • the common sparse component estimation step the sum of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation i-1 from the amplitude spectrogram of the M channel is used as the common sparse component.
  • the result obtained by dividing the sum by the sum of the content ratios that include the is estimated as the sparse common component for the i-th iteration.
  • the common sparse component estimating step includes a low rank component ratio calculating step of calculating a ratio of low rank components included in the amplitude spectrogram of the M channel, and M pieces of M spectra included in the amplitude spectrogram of the M channel based on the ratio of the low rank components.
  • the common sparse component includes M sparse components based on the residual component calculation step for calculating the residual component including M sparse components, and the residual component including M sparse components and the common sparse component.
  • Volume that calculates the content ratio as the volume of M sparse components And calculation steps, and a common sparse component calculation step by dividing the sum of the residual component containing M sparse component at the volume of the sum of the M sparse component calculates the common sparse component.
  • the common sparse component is estimated by performing iterative calculations in the low rank component ratio calculation step, the low rank component calculation step, the sparse component ratio calculation step, the residual component calculation step, the volume calculation step, and the common sparse component calculation step.
  • the present invention provides a computer-readable method for realizing a target acoustic signal restoration method for restoring a target acoustic signal included in an M-channel acoustic signal collected by M microphones (M is an integer of 2 or more) using a computer. It can also be specified as a computer program stored in a possible storage means. This computer program performs a time-frequency analysis step for obtaining an M-channel complex spectrogram by performing time-frequency analysis on an M-channel acoustic signal collected by M microphones, and extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram.
  • a common sparse component estimation step Using the amplitude component extraction step and the M-channel amplitude spectrogram as inputs, estimate the common sparse component including the sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms.
  • a common sparse component estimation step Using the amplitude component extraction step and the M-channel amplitude spectrogram as inputs, estimate the common sparse component including the sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms.
  • a common sparse component estimation step a phase restoration step that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram, and a purpose to convert the target acoustic complex spectrogram into a target acoustic signal that is a time signal To achieve a sound signal conversion
  • FIG. 1 It is a block diagram which shows the structure of an example of embodiment of the objective acoustic signal restoration system of this invention. It is a flowchart which shows the algorithm of the computer program used when implement
  • SDR signal-to-distortion ratio
  • FIG. 1 is a block diagram showing a configuration of an example of an embodiment of a target acoustic signal restoration system of the present invention realized by using a computer or a plurality of processors and a plurality of memories.
  • a speech signal is extracted as a target acoustic signal from a multi-channel acoustic signal obtained by observing speech disturbed by low rank noise.
  • Each microphone moves, and even if some of the microphones cannot observe the sound at a sufficiently large volume due to the obstacle, it is possible to extract the sound signal robustly.
  • the target acoustic signal restoration system 1 restores a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones.
  • the target acoustic signal restoration system 1 is realized by a computer or each using one or more processors and one or more memories, and a time frequency analysis unit 3 and an amplitude component extraction unit 5.
  • the time-frequency analysis unit 3 is, for example, M books provided in the flexible cable-shaped robot shown in Non-Patent Document 2 “Speech enhancement for flexible cable-shaped robot based on motion noise suppression using robust principal component analysis”.
  • the M channel acoustic signal collected by the microphone (M is an integer of 2 or more) is subjected to time-frequency analysis to obtain an M channel complex spectrogram.
  • the amplitude component extraction unit 5 extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram.
  • the common sparse component estimation unit 7 receives an M-channel amplitude spectrogram, and includes a common sparse component including a sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms. Is estimated.
  • the phase restoration unit 9 restores the phase of the common sparse component to obtain a target acoustic complex spectrogram.
  • the phase information is extracted from the time frequency analysis unit 3 by the phase component extraction unit 11.
  • the phase may be estimated from the amplitude spectrogram of the M channel, the common sparse component, etc., and the method for obtaining the phase is arbitrary. Therefore, the present invention is not limited to the provision of the phase component extraction unit 11.
  • the target acoustic signal conversion unit 13 converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal.
  • the common sparse component estimation unit 7 uses an iterative estimation method. Therefore, the common sparse component estimation unit 7 includes a low rank component ratio calculation unit 71, a low rank component calculation unit 72, a sparse component ratio calculation unit 73, a residual component calculation unit 74, a volume calculation unit 75, and a common sparse component. And an arithmetic unit 76.
  • the low rank component ratio calculation unit 71 calculates the ratio of the low rank components included in the amplitude spectrogram of the M channel.
  • the low rank component calculation unit 72 calculates M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components.
  • the sparse component ratio calculation unit 73 calculates the ratio of sparse components included in the amplitude spectrogram of the M channel. Then, the residual component calculation unit 74 calculates a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components.
  • the volume calculation unit 75 calculates a content ratio in which the M sparse components are included in the common sparse component as the volume of the M sparse components based on the residual component including the M sparse components and the common sparse component. To do. This content ratio is obtained in the process of iterative estimation when using an iterative estimation method such as a variational Bayes EM method or a sequential variational Bayes EM method.
  • the common sparse component calculation unit 76 uses the sum of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation from the amplitude spectrogram of the M channel as M common sparse components. The result obtained by dividing by the sum of the content ratios containing the components is calculated as a common sparse component for i-th iteration estimation.
  • the low rank component ratio calculation unit 71, the low rank component calculation unit 72, the sparse component ratio calculation unit 73, the residual component calculation unit 74, the volume calculation unit 75, and the common sparse component calculation unit 76 perform an iterative calculation to thereby perform a common sparse component. Is estimated.
  • the common sparse component estimation unit 7 estimates the common sparse component on the assumption that “the acoustic signal of each channel is decomposed into a low rank component and a common sparse component common between channels”.
  • the phase of the sparse component is restored by the phase restoring unit 9, and the restored target acoustic complex spectrogram is converted by the target acoustic signal converting unit 13 to perform noise suppression.
  • the common sparse component is modeled as different from the content ratio (volume) of the sparse component of each channel included in the common sparse component. Even if there is a microphone that cannot sufficiently record acoustic signals (speech, etc.), robust speech enhancement is realized.
  • the target sound signal (speech etc.) is enhanced from the low rank of noise and the sparsity of the target sound signal (speech etc.) without prior information of noise.
  • the common sparse component including the sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels is estimated, the influence of the low rank component is minimized.
  • the target acoustic signal can be restored without receiving it. As a result, the restoration accuracy can be made higher than before.
  • This embodiment specializes in the analysis of non-negative real value matrices, and is robust from the low rank of noise and the sparseness of the target acoustic signal such as speech without having to record the timbre information of the noise signal in advance.
  • Target sound signal enhancement can be realized. With this feature, it is possible to realize enhancement of target acoustic signals (speech etc.) that operate robustly even in environments where there are many obstacles around the microphone array or when information on noise cannot be obtained in advance. Furthermore, even when some microphones cannot record sound at a sufficiently high volume due to obstacles or the like, the target sound signal (sound or the like) can be strongly emphasized.
  • a generation model employed in the present embodiment will be described.
  • the target acoustic signal may be referred to as an audio signal for convenience.
  • the speech enhancement problem handled by this generation model is defined below.
  • F and T represent the number of frequency bins and the number of time frames, respectively.
  • an amplitude spectrogram of a channel is called an acoustic signal, and f and t are frequency bins and time frame indexes.
  • the approximation error of the input amplitude spectrogram is evaluated using the Kullback-Leibler (KL) pseudorange.
  • KL Kullback-Leibler
  • minimizing the KL pseudorange corresponds to maximizing the Poisson distribution likelihood.
  • the likelihood model is defined as follows.
  • Equation (4) Ym is the mth amplitude spectrogram of the M channel amplitude spectrogram, Hm is the volume of each base at each time, Wm is the K bases, S is the common sparse component, and Gm is the sparse component. It is the content ratio in the microphone.
  • s ft, g mt, w mfk , h mkt each represents S, Gm, Wm, the elements of Hm, y mft denotes a complex spectrogram of the observation.
  • the prior distribution of the low-rank component basis matrix and activation matrix is formulated using the Gamma distribution, which is the conjugate prior distribution of the Poisson distribution.
  • ⁇ , ⁇ ) represents a gamma distribution having a shape parameter ⁇ and a rate parameter ⁇ . Also,
  • the basis and the activation matrix can be limited to sparse, whereby the low rank component L is limited to the low rank matrix.
  • the Gaussian prior distribution is placed in the representation of the sparse component, and the sparse component is represented by placing the Jeffreys hyper prior distribution in the accuracy parameter.
  • the Gamma distribution is placed in the prior distribution of the sparse component, and the Jeffreys hyper prior distribution is placed in the rate parameter of the Gamma distribution corresponding to the accuracy parameter in the Gaussian distribution.
  • Gamma distribution which is a conjugate prior distribution of Poisson distribution, is placed in the volume (content ratio) g mt of the sparse component of each microphone.
  • ⁇ g is a super parameter for adjusting the variation in the volume of the sparse component in each microphone.
  • q (•) is a variational posterior distribution.
  • the posterior distribution is decomposed and approximated as follows, and inference is performed by minimizing the KL pseudorange with the true posterior distribution.
  • each update rule can be easily derived by using Jensen's inequality and Lagrange's undetermined multiplier method.
  • is the mean of random variables
  • each posterior distribution is obtained by iteratively updating the following update rule with other parameters fixed.
  • FIG. 2 is a flowchart showing an algorithm of a computer program used when the common sparse component estimation unit 7 of FIG. 1 is realized by an iterative estimation method using a computer.
  • each expression is attached to the steps in which the above expressions (11) to (18) are used.
  • the iterative estimation termination condition is repeated 200 times, or each estimated value data Ym, Hm, Wm, S, ⁇ s , g m is compared with the previous processing, and ends when the comparison result is approximate. It was supposed to be.
  • FIG. 2 is a flowchart showing an algorithm of a computer program used when the common sparse component estimation unit 7 of FIG. 1 is realized by an iterative estimation method using a computer.
  • each expression is attached to the steps in which the above expressions (11) to (18) are used.
  • the iterative estimation termination condition is repeated 200 times, or each estimated value data Ym, Hm, Wm, S, ⁇ s , g m is compared with the previous processing, and
  • Ym is the m-th amplitude spectrogram of the M-channel amplitude spectrogram
  • Hm is the volume of each base at each time
  • Wm is K bases
  • S is the common sparse component
  • ⁇ s is Bayesian estimation.
  • the coefficient, g m is the content ratio of the sparse component.
  • Equation (13) The calculation of the sum total of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation from the M channel amplitude spectrograms, respectively, performed by the common sparse component calculation unit 76 is given by Equation (13).
  • the calculation is performed using the expression before “,” in Expression (13).
  • the operation of dividing the sum of the residuals by the sum of the content ratios (volumes) in which M sparse components are included in the common sparse component is to calculate the average of the variational posterior distribution obtained by Equation (13) To be implemented. This result is estimated as a common sparse component of the i-th iteration estimation.
  • the phase restorer 9 calculates the target acoustic spectrogram s ′ ft (output) by Equation (19).
  • s ft represents an element of S
  • y ′ mft represents an element of an observation complex spectrogram
  • the phase restoration unit 9 includes the target acoustic spectrogram s ′ ft (output) including the volume Hm of each base at each time, K bases Wm, and the content ratio g m of the sparse components. May be restored.
  • the phase restoration unit 9 calculates the target acoustic spectrogram S ft using equation (20).
  • s ft, g mt, w mfk, h mkt each represents S, g m, Wm
  • the elements of Hm, y 'mft represents an element of the complex spectrogram of the observation.
  • FIG. 3 shows a photograph of a flexible cable-like robot equipped with a microphone array.
  • the main body consists of a corrugated tube with a diameter of 38 mm and has a total length of 3 m.
  • This robot is similar to Namari et al.'S Tube-type Active Scope Camera [J. Fukuda, et al. Remote vertical exploration by active scope camera into collapsed buildings. In IEEE / RSJ IROS, pp. 1882-1888, 2014.] Advancing with drive using cilia and vibration motor. Seven vibration motors are mounted in series in the robot at intervals of 40 cm.
  • Requirement 1 The robot is placed in free space and the sound source is placed in front of the robot.
  • the room reverberation time (RT60) was 750ms.
  • Requirement 2 The robot is placed in the door gap and the sound source is placed in front of the robot. Four microphones are blocked from the sound source by the door.
  • the reverberation time (RT60) was 990 ms.
  • the robot was driven, and 60-second operation noise was recorded while shaking the robot left and right using the hand and vibration motor.
  • the target sound was created by convolving a 60-second recording with the impulse response when the robot was stationary.
  • a total of 4 types (240 msec) of male voice and 2 female voices were used for recording. These recordings were performed with 8ch synchronization, 24-bit quantization, and 16 kHz sampling.
  • SDR Signal-to-distortion ratio
  • SDR Signal to distortion ratio
  • FIG. 5 shows the speech enhancement performance under the condition 1 and the arrangement condition 2 and the SNR condition as a signal-to-distortion ratio (SDR).
  • SDR signal-to-distortion ratio
  • FIG. 7 is an extract of low rank components Lm ⁇ of 4 channels out of 8 channels.
  • FIG. 8 shows the volume gm with the common sparse component of 4 channels out of 8 channels.
  • FIG. 9 shows a sparse common sparse component.
  • FIG. 10 shows the enhancement result with MNMF
  • FIG. 11 shows the enhancement result with Med-RPCA
  • FIG. 12 shows the enhancement result with RPCA.
  • FIG. 13 shows a result of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram, and converting the target acoustic complex spectrogram into a target acoustic signal that is a time signal.
  • speech enhancement is performed without noise prior information from the low rank of noise and the sparsity of the target acoustic signal. Therefore, it is commonly included in the amplitude spectrogram of the most channels among the M channel amplitude spectrograms. Estimate common sparse components, including likely sparse frequency components. Then, by restoring the phase of the common sparse component and converting the restored target acoustic complex spectrogram to the target acoustic signal, noise suppression is performed, so that the target acoustic signal is recovered without being affected by the low-rank component as much as possible. And the accuracy of restoration can be made higher than before.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Provided are a system and a method for voice signal restoration that are able to restore, with a high degree of accuracy, a target sound signal from a noise-containing sound signal without using prior information. A common sparse component estimation unit 7 that receives, as input, amplitude spectrograms of M channels is provided for estimating a common sparse component containing a sparse time-frequency component which is highly likely to be commonly included in the amplitude spectrograms of the largest number of channels among the M channels. A phase restoration unit 9 restores a phase of the common sparse component so as to obtain a target sound complex spectrogram. A target sound signal conversion unit 13 converts the target sound complex spectrogram into a target sound signal, which is a temporal signal.

Description

目的音響信号復元システム及び方法Target sound signal restoration system and method
 本発明は、低ランク性雑音により妨害された目的音響信号を、観測した複数チャネルの音響信号から復元する目的音響信号復元システム及び方法に関するものである。 The present invention relates to a target sound signal restoration system and method for restoring a target sound signal disturbed by low rank noise from observed multi-channel sound signals.
 非特許文献1には、雑音と目的音響信号の混合音を観測した多チャネル音響信号から、目的音響信号のみを分離できる技術が開示されている。 Non-Patent Document 1 discloses a technique capable of separating only a target sound signal from a multi-channel sound signal obtained by observing a mixed sound of noise and the target sound signal.
 非特許文献2には、複数のマイクロホンによって採取した複数チャネルの音響信号に含まれる音声信号を復元する技術が開示されている。具体的には、マイクロホンアレイを搭載した柔軟索状ロボットで収録した音響信号の音声強調のために、RPCA(ロバスト主成分分析)をマイクロホンアレイに適用する雑音抑圧法が開示されている。この技術では、まず各チャネルにそれぞれRPCA を適用し、各チャネルの共通成分を抽出するために中央値を取って統合している。 Non-Patent Document 2 discloses a technique for restoring audio signals included in acoustic signals of a plurality of channels collected by a plurality of microphones. Specifically, a noise suppression method in which RPCA (Robust Principal Component Analysis) is applied to a microphone array for speech enhancement of an acoustic signal recorded by a flexible cable-like robot equipped with a microphone array is disclosed. In this technology, first, RPCA is applied to each channel, and the median is extracted and integrated in order to extract the common component of each channel.
 また非特許文献3に記載の技術では、雑音の事前情報無しで雑音の低ランク性と音声のスパース性から音声強調を行う。 In the technique described in Non-Patent Document 3, speech enhancement is performed from low noise rank and speech sparsity without noise prior information.
 さらに特許文献1(特許第5752324号公報)に記載の技術では、単チャネル音響信号からインパルス性(突発的) 雑音を除去する。この技術は、インパルス性雑音を除去する性能は高いが、一方で想定しない持続的な非定常雑音(低ランク雑音) では性能が劣化する。 Furthermore, in the technique described in Patent Document 1 (Japanese Patent No. 5752324), impulsive (sudden) noise is removed from a single channel acoustic signal. This technology has a high performance for removing impulsive noise, but on the other hand, the performance deteriorates for unexpected non-stationary noise (low rank noise).
 また特許文献2(特開2009-116275公報)に記載の技術では、単チャネル音響信号から平均二乗誤差最小法(MMSE) に基づいて雑音を抑圧する。MMSE は、雑音の定常性を仮定するため非定常雑音の抑圧では性能が劣化する。 In the technique described in Patent Document 2 (Japanese Patent Laid-Open No. 2009-116275), noise is suppressed from a single-channel acoustic signal based on the mean square error minimum method (MMSE). Since MMSE 仮 定 assumes the stationary nature of noise, the performance degrades with the suppression of non-stationary noise.
 特許文献3(特開2014-503849公報)に記載の技術では、雑音源の近くに子機マイクを配置し、本マイクの情報を積極的に利用して音声強調を行う技術が開示されている。本技術では、雑音源の位置が特定されており、また子機マイクをその雑音源近くに配置する必要がある。 In the technique described in Patent Document 3 (Japanese Patent Laid-Open No. 2014-503849), a technique is disclosed in which a handset microphone is placed near a noise source and voice enhancement is performed by actively using information of the microphone. . In this technique, the position of the noise source is specified, and it is necessary to arrange the handset microphone near the noise source.
 特許文献4(特開2015-095897公報)には、ビデオ信号に対し低ランク成分とスパース成分を抽出することで、背景映像と移動する物体の映像を分離する技術が開示されている。本手法を音声強調へ応用することは可能であるが、一部のマイクが障害物等で音声を十分録音出来なかったときに性能が大きく劣化する。 Japanese Patent Laid-Open No. 2015-095897 discloses a technique for separating a background image and a moving object image by extracting a low-rank component and a sparse component from a video signal. Although this method can be applied to speech enhancement, performance is greatly degraded when some microphones cannot record sound sufficiently due to obstacles.
 特許文献5(特開2014-058399公報)に記載の技術では、任意の数の音源信号の混合音を観測した多チャネル音響信号から各音源信号を分離抽出する。本技術では、各マイクロホンの位置および音源位置が固定であると仮定されており、これらが動く場合性能が劣化する。 In the technique described in Patent Document 5 (Japanese Patent Laid-Open No. 2014-058399), each sound source signal is separated and extracted from a multi-channel acoustic signal obtained by observing a mixed sound of an arbitrary number of sound source signals. In the present technology, it is assumed that the position of each microphone and the sound source position are fixed, and the performance deteriorates when they move.
特許第5752324号公報Japanese Patent No.5752324 特開2009-116275公報JP2009-116275 特開2014-503849公報JP 2014-503849 特開2015-095897公報JP2015-095897 特開2014-058399公報JP 2014-058399 JP
 しかしながら、非特許文献1に示された技術では、雑音を抑圧し音声を強調するには事前に雑音の音色情報の収録が不可欠であり、雑音がシステムの使用環境に依存して変化する場合などでは使用が困難だった。 However, in the technique shown in Non-Patent Document 1, it is indispensable to record noise timbre information in advance in order to suppress noise and enhance speech, and the case where noise changes depending on the use environment of the system. It was difficult to use.
 また非特許文献2に示された技術では、多チャネル音響信号から雑音の低ランク性と音声のスパース性を用いて目的音声を強調する。本技術では、各チャネルの振幅スペクトログラムに対し個別にロバスト主成分分析を用いて低ランク成分とスパース成分の分離を行い、その後マイクロホンごとのスパース成分について各時間周波数点で中央値を選択して音声を強調していた。本技術では、複数のマイクロホンが中央値で全マイクの信号を統合するため、一部のマイクが障害物等で音声を十分録音出来なかったときに性能が大きく劣化する問題があった。 In the technique disclosed in Non-Patent Document 2, the target speech is emphasized from the multi-channel acoustic signal by using the low rank of noise and the sparseness of speech. In this technology, low-rank components and sparse components are separated from each other by using a robust principal component analysis for each channel amplitude spectrogram, and then a median value is selected for each sparse component for each microphone at each time frequency point. Was emphasized. In the present technology, since a plurality of microphones integrates the signals of all microphones with a median value, there is a problem that performance is greatly deteriorated when some microphones cannot sufficiently record sound due to an obstacle or the like.
 また非特許文献3に示された従来の技術では、実数値行列の解析に特化しており、音響信号の振幅スペクトログラムである非負値行列の解析には不向きで、音響信号処理のための多チャネル拡張や信頼度推定機能の実現が困難だった。 The conventional technique shown in Non-Patent Document 3 specializes in the analysis of real-valued matrices, is not suitable for the analysis of non-negative matrixes that are amplitude spectrograms of acoustic signals, and is a multi-channel for acoustic signal processing. It was difficult to realize expansion and reliability estimation functions.
 従来、多チャネルのマイクロホンにより音声を集音した場合に、一部のマイクロホンが障害物等で音声を十分大きな音量で収録できない場合でも頑健に音声を強調できる技術は提案されていない。例えば、瓦礫の狭い隙間に侵入し被災者を捜索する柔軟索状レスキューロボットでは、自身の動作雑音により被災者の声を聞き取りづらくなる問題があった。 Conventionally, there has not been proposed a technique capable of robustly emphasizing voice even when voice is collected by a multi-channel microphone and even if some microphones cannot record the voice at a sufficiently high volume due to obstacles or the like. For example, a flexible cable-like rescue robot that enters a narrow gap in rubble and searches for a victim has a problem that it is difficult to hear the voice of the victim due to its own operational noise.
 本発明の目的は、事前情報を用いずに雑音が含まれる音響信号から目的音響信号を高い精度で復元することができる音声信号復元システム及び方法を提供することにある。 An object of the present invention is to provide an audio signal restoration system and method that can restore a target acoustic signal with high accuracy from an acoustic signal including noise without using prior information.
 本発明は、M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する目的音響信号復元システムを対象として、時間周波数解析部と、振幅成分抽出部と、共通スパース成分推定部と、位相復元部と、目的音響信号復元部とを備えている。時間周波数解析部は、M本のマイクロホンによって採取したMチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る。振幅成分抽出部は、Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する。共通スパース成分推定部は、Mチャネルの振幅スペクトログラムを入力として、Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース周波数成分を含む共通スパース成分を推定する。ここで「Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラム」の代表的な例としては、M本のマイクロホンのうち例えば2本のマイクロホンが音響信号を殆ど採取していなかったとすると、この2本のマイクロホンからの音響信号から得た2チャネルの振幅スペクトログラムは、「最も多くのチャネルの振幅スペクトログラム」には含まれない。そして残りのM-2本のマイクロホンが採取した音響信号から得たM-2チャネルの音響信号から得たM-2チャネルの振幅スペクトログラムが「最も多くのチャネルの振幅スペクトログラム」になる。位相復元部では、共通スパース成分の位相を復元して目的音響複素スペクトログラムとする。位相は、Mチャネルの振幅スペクトログラム、共通スパース成分等から推定すればよく、位相を求める方法は任意である。そして目的音響信号変換部は、目的音響複素スペクトログラムを時間信号である目的音響信号に変換する。 The present invention is directed to a target acoustic signal restoration system that restores a target acoustic signal included in an M channel acoustic signal collected by M microphones (M is an integer of 2 or more). An extraction unit, a common sparse component estimation unit, a phase restoration unit, and a target acoustic signal restoration unit are provided. The time-frequency analysis unit obtains an M-channel complex spectrogram by performing time-frequency analysis on an M-channel acoustic signal collected by M microphones. The amplitude component extraction unit extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram. The common sparse component estimator receives, as an input, the amplitude spectrogram of the M channel, and a common sparse component including a sparse frequency component that is likely to be included in the amplitude spectrogram of the most channels among the amplitude spectrograms of the M channel. presume. Here, as a representative example of the “amplitude spectrogram of the most channels among M-channel amplitude spectrograms”, if, for example, two microphones out of M microphones hardly collect an acoustic signal, A two-channel amplitude spectrogram obtained from an acoustic signal from two microphones is not included in the “amplitude spectrogram of most channels”. The amplitude spectrogram of the M-2 channel obtained from the acoustic signal of the M-2 channel obtained from the acoustic signals collected by the remaining M-2 microphones becomes the “amplitude spectrogram of the most channels”. The phase restoration unit restores the phase of the common sparse component to obtain a target acoustic complex spectrogram. The phase may be estimated from the amplitude spectrogram of the M channel, the common sparse component, etc., and the method for obtaining the phase is arbitrary. The target acoustic signal conversion unit converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal.
 本発明において、共通スパース成分推定部では、「各チャネルの音響信号が低ランク成分とチャネル間共通の共通スパース成分に分解される」と仮定して共通スパース成分を推定し、この共通スパース成分の位相を復元し、且つ復元した目的音響複素スペクトログラムを目的音響信号に変換することにより、雑音抑圧を行う。そして共通スパース成分は、共通スパース成分に含まれている各チャネルのスパース成分の含有比率(本願明細書では、この含有比率を「音量」ということがある。)のみ違うものとモデル化することにより、この含有比率(音量)の推定により目的音響信号を十分に収録できないマイクロホンが存在しても頑強な目的音響信号強調を実現する。 In the present invention, the common sparse component estimation unit estimates the common sparse component on the assumption that the acoustic signal of each channel is decomposed into a low rank component and a common sparse component common between channels, and the common sparse component Noise is suppressed by restoring the phase and converting the restored target acoustic complex spectrogram into a target acoustic signal. The common sparse component is modeled as different from the content ratio of the sparse component of each channel included in the common sparse component (this content ratio may be referred to as “volume” in the present specification). By estimating the content ratio (volume), robust target sound signal enhancement is realized even when there is a microphone that cannot sufficiently record the target sound signal.
 具体的に、本発明では、雑音の事前情報無しで雑音の低ランク性と目的音響信号のスパース性から音声強調を行う。本発明においては、Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース周波数成分を含む共通スパース成分を推定するため、低ランク成分の影響を極力受けることなく目的音響信号を復元することができる。その結果、復元の精度を従来よりも高くすることができる。 Specifically, in the present invention, speech enhancement is performed from the low rank of noise and the sparseness of the target acoustic signal without the prior information of noise. In the present invention, since the common sparse component including the sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels is estimated, the influence of the low rank component is minimized. The target acoustic signal can be restored without receiving it. As a result, the restoration accuracy can be made higher than before.
 本発明では、非負実数値行列である振幅スペクトログラムの解析に特化しており、マイクロホンの配置などのシステムの使用環境に依存せず、マイクロホンアレイの周囲に障害物が多い環境や、事前に雑音に関する情報を得られない場合でも頑健に動作する目的音響信号強調を実現できる。さらに一部のマイクロホンが障害物等で音声を十分大きな音量で収録できない場合でも頑健に目的音響信号を強調できる。例えば、瓦礫の狭い隙間に侵入し被災者を捜索する柔軟索状レスキューロボットでは、自身の動作雑音により被災者の声を聞き取りづらくなる問題があったが、本発明により瓦礫内でも頑健に音声を強調することができる。 The present invention specializes in the analysis of amplitude spectrograms that are non-negative real-valued matrices, and does not depend on the system usage environment such as the placement of microphones. Even when information cannot be obtained, it is possible to realize target sound signal enhancement that operates robustly. Furthermore, even if some microphones cannot record sound at a sufficiently high volume due to obstacles or the like, the target acoustic signal can be robustly enhanced. For example, a flexible cable rescue robot that intrudes into a narrow gap in the rubble and searches for the victim has a problem that it is difficult to hear the voice of the victim due to its own operating noise. Can be emphasized.
 共通スパース成分推定部は、Mチャネルの振幅スペクトログラムからそれぞれ反復推定i-1回目の低ランク成分を除いて得たM個のスパース成分を含む残余の総和を、共通スパース成分にM個のスパース成分が含まれている含有比率(音量)の総和で除算して得た結果を反復推定i回目の共通スパース成分として推定する。この含有比率は、変分ベイズEM法や逐次変分ベイズEM法等の反復推定法を用いた反復推定の過程で徐々に収束することになる。 The common sparse component estimator obtains the sum of the residuals including the M sparse components obtained by removing the low-rank component of the i-th iteration estimation from the amplitude spectrogram of the M channel, and uses the M sparse components as the common sparse component. The result obtained by dividing by the sum of the content ratios (sound volumes) containing is estimated as the sparse common component for the i-th iteration. This content ratio gradually converges in the process of iterative estimation using an iterative estimation method such as the variational Bayes EM method or the sequential variational Bayes EM method.
 反復推定法を用いる場合の共通スパース成分推定部は、Mチャネルの振幅スペクトログラムに含まれる低ランク成分の比率を演算する低ランク成分比率演算部と、低ランク成分の比率に基づいてMチャネルの振幅スペクトログラムに含まれるM本の低ランク成分を演算する低ランク成分演算部と、Mチャネルの振幅スペクトログラムに含まれるスパース成分の比率を演算するスパース成分比率演算部と、スパース成分の比率に基づいてMチャネルの振幅スペクトログラムに含まれるM個のスパース成分を含む残余成分を演算する残余成分演算部と、M個のスパース成分を含む残余成分と共通スパース成分とに基づいて、共通スパース成分にM個のスパース成分が含まれている含有比率をM個のスパース成分の音量として演算する音量演算部と、M個のスパース成分を含む残余成分の総和をM個のスパース成分の音量の総和で除算して共通スパース成分を演算する共通スパース成分演算部とを備える。そして低ランク成分比率演算部、低ランク成分演算部、スパース成分比率演算部、残余成分演算部、音量演算部及び共通スパース成分演算部において反復演算を行うことにより共通スパース成分を推定する。このような構成で、反復推定法を用いると、適宜の反復演算を所定回数繰り返すことにより、高い精度で共通スパース成分を推定することができる。 When the iterative estimation method is used, the common sparse component estimation unit includes a low rank component ratio calculation unit that calculates a ratio of low rank components included in the amplitude spectrogram of the M channel, and an amplitude of the M channel based on the ratio of the low rank components. A low rank component calculation unit that calculates M low rank components included in the spectrogram, a sparse component ratio calculation unit that calculates a ratio of sparse components included in the amplitude spectrogram of the M channel, and M based on the sparse component ratio Based on the residual component calculation unit that calculates the residual component including M sparse components included in the amplitude spectrogram of the channel, and the residual component including M sparse components and the common sparse component, M common sparse components Volume calculation unit for calculating the content ratio containing sparse components as the volume of M sparse components , And a common sparse component calculator for calculating a common sparse component the sum of the residual component containing M sparse component is divided by the sum of the volume of M sparse component. And a common sparse component is estimated by performing iterative calculation in a low rank component ratio calculation part, a low rank component calculation part, a sparse component ratio calculation part, a residual component calculation part, a volume calculation part, and a common sparse component calculation part. When the iterative estimation method is used in such a configuration, the common sparse component can be estimated with high accuracy by repeating an appropriate iterative operation a predetermined number of times.
 共通スパース成分推定部は、例えば、変分ベイズEM法や逐次変分ベイズEM法により共通スパース成分をベイズ推定するベイズ推定器によって構成することができる。ベイズ推定器を用いると、簡単に共通スパース成分を推定することができる。 The common sparse component estimator can be configured by, for example, a Bayesian estimator that Bayes estimates common sparse components by a variational Bayes EM method or a sequential variational Bayes EM method. If a Bayesian estimator is used, a common sparse component can be easily estimated.
 本発明は、コンピュータを用いて、M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する目的音響信号復元方法としても特定することができる。この方法では、M本のマイクロホンによって採取したMチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る時間周波数解析ステップと、Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する振幅成分抽出ステップと、Mチャネルの振幅スペクトログラムを入力として、前記Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース時間周波数成分を含む共通スパース成分を推定する共通スパース成分推定ステップと、共通スパース成分の位相を復元して目的音響複素スペクトログラムとする位相復元ステップと、目的音響複素スペクトログラムを時間信号である目的音響信号に変換する目的音響信号変換ステップとがコンピュータで実施される。 The present invention can also be specified as a target acoustic signal restoration method for restoring a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones using a computer. . In this method, a time-frequency analysis step for obtaining an M-channel complex spectrogram by performing time-frequency analysis of an M-channel acoustic signal collected by M microphones, and an amplitude for extracting an M-channel amplitude spectrogram from the M-channel complex spectrogram. A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M channel amplitude spectrograms, with the component extraction step and the M channel amplitude spectrograms as inputs. A common sparse component estimation step to be estimated, a phase restoration step that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram, and a target acoustic signal variable that converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal. Steps and are performed by a computer.
 共通スパース成分推定ステップでは、Mチャネルの振幅スペクトログラムからそれぞれ反復推定i-1回目の低ランク成分を除いて得たM個のスパース成分を含む残余の総和を、共通スパース成分にM個のスパース成分が含まれている含有比率の総和で除算して得た結果を反復推定i回目の共通スパース成分として推定する。 In the common sparse component estimation step, the sum of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation i-1 from the amplitude spectrogram of the M channel is used as the common sparse component. The result obtained by dividing the sum by the sum of the content ratios that include the is estimated as the sparse common component for the i-th iteration.
 共通スパース成分推定ステップは、Mチャネルの振幅スペクトログラムに含まれる低ランク成分の比率を演算する低ランク成分比率演算ステップと、低ランク成分の比率に基づいてMチャネルの振幅スペクトログラムに含まれるM個の低ランク成分を演算する低ランク成分演算ステップと、Mチャネルの振幅スペクトログラムに含まれるスパース成分の比率を演算するスパース成分比率演算ステップと、スパース成分の比率に基づいてMチャネルの振幅スペクトログラムに含まれるM個のスパース成分を含む残余成分を演算する残余成分演算ステップと、M個のスパース成分を含む残余成分と共通スパース成分とに基づいて、共通スパース成分にM個のスパース成分が含まれている含有比率をM個のスパース成分の音量として演算する音量演算ステップと、M個のスパース成分を含む残余成分の総和を前記M個のスパース成分の音量の総和で除算して共通スパース成分を演算する共通スパース成分演算ステップとからなる。そして低ランク成分比率演算ステップ、低ランク成分演算ステップ、スパース成分比率演算ステップ、残余成分演算ステップ、音量演算ステップ及び共通スパース成分演算ステップにおいて反復演算を行うことにより共通スパース成分を推定する。 The common sparse component estimating step includes a low rank component ratio calculating step of calculating a ratio of low rank components included in the amplitude spectrogram of the M channel, and M pieces of M spectra included in the amplitude spectrogram of the M channel based on the ratio of the low rank components. Included in the M-channel amplitude spectrogram based on the low-rank component calculation step for calculating the low-rank component, the sparse component ratio calculation step for calculating the ratio of the sparse component included in the M-channel amplitude spectrogram, and the sparse component ratio The common sparse component includes M sparse components based on the residual component calculation step for calculating the residual component including M sparse components, and the residual component including M sparse components and the common sparse component. Volume that calculates the content ratio as the volume of M sparse components And calculation steps, and a common sparse component calculation step by dividing the sum of the residual component containing M sparse component at the volume of the sum of the M sparse component calculates the common sparse component. The common sparse component is estimated by performing iterative calculations in the low rank component ratio calculation step, the low rank component calculation step, the sparse component ratio calculation step, the residual component calculation step, the volume calculation step, and the common sparse component calculation step.
 本発明は、M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する目的音響信号復元方法を、コンピュータを用いて実現するためにコンピュータ読み取り可能な記憶手段に記憶されたコンピュータプログラムとしても特定することができる。このコンピュータプログラムは、M本のマイクロホンによって採取したMチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る時間周波数解析ステップと、Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する振幅成分抽出ステップとMチャネルの振幅スペクトログラムを入力として、Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース時間周波数成分を含む共通スパース成分を推定する共通スパース成分推定ステップと、共通スパース成分の位相を復元して目的音響複素スペクトログラムとする位相復元ステップと、目的音響複素スペクトログラムを時間信号である目的音響信号に変換する目的音響信号変換ステップとをコンピュータで実現する。 The present invention provides a computer-readable method for realizing a target acoustic signal restoration method for restoring a target acoustic signal included in an M-channel acoustic signal collected by M microphones (M is an integer of 2 or more) using a computer. It can also be specified as a computer program stored in a possible storage means. This computer program performs a time-frequency analysis step for obtaining an M-channel complex spectrogram by performing time-frequency analysis on an M-channel acoustic signal collected by M microphones, and extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram. Using the amplitude component extraction step and the M-channel amplitude spectrogram as inputs, estimate the common sparse component including the sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms. A common sparse component estimation step, a phase restoration step that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram, and a purpose to convert the target acoustic complex spectrogram into a target acoustic signal that is a time signal To achieve a sound signal conversion step on the computer.
本発明の目的音響信号復元システムの実施の形態の一例の構成を示すブロック図である。It is a block diagram which shows the structure of an example of embodiment of the objective acoustic signal restoration system of this invention. 図1の共通スパース成分推定部を、コンピュータを用いて反復推定法により実現する場合に用いるコンピュータプログラムのアルゴリズムを示すフローチャートである。It is a flowchart which shows the algorithm of the computer program used when implement | achieving the common sparse component estimation part of FIG. 1 by the iterative estimation method using a computer. マイクロホンアレイを搭載した柔軟索状ロボットの写真である。It is a photograph of a flexible cable robot equipped with a microphone array. ロボットと音声を再生するスピーカ(音源)の配置条件1と条件2を説明するために用いる図である。It is a figure used in order to explain arrangement conditions 1 and 2 of a robot and a speaker (sound source) that reproduces sound. (A)及び(B)は、条件1と条件2の配置条件及びSNR 条件での音声強調性能を信号対歪比(SDR)で示す図である。(A) And (B) is a figure which shows the audio | voice emphasis performance on the arrangement | positioning conditions of condition 1 and condition 2, and SNR conditions with a signal-to-distortion ratio (SDR). 多チャネルの振幅スペクトログラムの8チャネル中4チャネルを抜粋した図である。It is the figure which extracted 4 channels in 8 channels of the multi-channel amplitude spectrogram. 8チャネル中4チャネルの低ランク成分Lm を抜粋した図である。It is the figure which extracted low rank component Lm of 4 channels among 8 channels. 8チャネル中4チャネルの共通スパース成分での音量gm を示した図である。It is the figure which showed the sound volume gm in the common sparse component of 4 channels among 8 channels. 共通スパース成分を示す図である。It is a figure which shows a common sparse component. MNMF での強調結果を示す図である。It is a figure which shows the emphasis result in MNMF IV. Med-RPCA での強調結果を示す図である。It is a figure which shows the emphasis result in Med-RPCA IV. RPCA での強調結果を示す図である。It is a figure which shows the emphasis result in RPCA IV. 共通スパース成分の位相を復元して、目的音響複素スペクトログラムとし、この目的音響複素スペクトログラムを時間信号である目的音響信号に変換した結果を示す図である。It is a figure which shows the result of having decompress | restored the phase of a common sparse component, setting it as the target acoustic complex spectrogram, and converting this target acoustic complex spectrogram into the target acoustic signal which is a time signal.
 以下図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
 (実施の形態の構成)
 図1は、コンピュータまたは複数のプロセッサと複数のメモリを用いて実現される本発明の目的音響信号復元システムの実施の形態の一例の構成を示すブロック図である。本実施の形態では、具体的に、低ランク性雑音により妨害された音声を観測した多チャネル音響信号から、目的音響信号として音声信号を抽出する。そして各マイクロホンが移動し、一部のマイクロホンが障害物により音声を十分大きな音量で観測出来なくても、頑健に音声信号を抽出することを可能にする。本実施の形態の目的音響信号復元システム1は、M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する。
(Configuration of the embodiment)
FIG. 1 is a block diagram showing a configuration of an example of an embodiment of a target acoustic signal restoration system of the present invention realized by using a computer or a plurality of processors and a plurality of memories. In the present embodiment, specifically, a speech signal is extracted as a target acoustic signal from a multi-channel acoustic signal obtained by observing speech disturbed by low rank noise. Each microphone moves, and even if some of the microphones cannot observe the sound at a sufficiently large volume due to the obstacle, it is possible to extract the sound signal robustly. The target acoustic signal restoration system 1 according to the present embodiment restores a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones.
 本実施の形態の目的音響信号復元システム1は、コンピュータによってそれぞれ実現されるか、1以上のプロセッサと1以上のメモリを用いてそれぞれ実現される、時間周波数解析部3と、振幅成分抽出部5と、共通スパース成分推定部7と、位相復元部9と、位相成分抽出部11と、目的音響信号変換部13とを備えている。時間周波数解析部3は、例えば、非特許文献2「ロバスト主成分分析を用いた動作雑音抑圧に基づく柔軟索状ロボットのための音声強調」に示された柔軟索状ロボットに設けられたM本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る。振幅成分抽出部5は、Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する。共通スパース成分推定部7は、Mチャネルの振幅スペクトログラムを入力として、Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース周波数成分を含む共通スパース成分を推定する。位相復元部9では、共通スパース成分の位相を復元して目的音響複素スペクトログラムとする。本実施の形態では、位相情報を、時間周波数解析部3から位相成分抽出部11により抽出している。なお位相は、Mチャネルの振幅スペクトログラム、共通スパース成分等から推定すればよく、位相を求める方法は任意である。したがって本発明は、位相成分抽出部11を設けることに限定されるものではない。そして目的音響信号変換部13は、目的音響複素スペクトログラムを時間信号である目的音響信号に変換する。 The target acoustic signal restoration system 1 according to the present embodiment is realized by a computer or each using one or more processors and one or more memories, and a time frequency analysis unit 3 and an amplitude component extraction unit 5. A common sparse component estimation unit 7, a phase restoration unit 9, a phase component extraction unit 11, and a target acoustic signal conversion unit 13. The time-frequency analysis unit 3 is, for example, M books provided in the flexible cable-shaped robot shown in Non-Patent Document 2 “Speech enhancement for flexible cable-shaped robot based on motion noise suppression using robust principal component analysis”. The M channel acoustic signal collected by the microphone (M is an integer of 2 or more) is subjected to time-frequency analysis to obtain an M channel complex spectrogram. The amplitude component extraction unit 5 extracts an M-channel amplitude spectrogram from the M-channel complex spectrogram. The common sparse component estimation unit 7 receives an M-channel amplitude spectrogram, and includes a common sparse component including a sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the M-channel amplitude spectrograms. Is estimated. The phase restoration unit 9 restores the phase of the common sparse component to obtain a target acoustic complex spectrogram. In the present embodiment, the phase information is extracted from the time frequency analysis unit 3 by the phase component extraction unit 11. The phase may be estimated from the amplitude spectrogram of the M channel, the common sparse component, etc., and the method for obtaining the phase is arbitrary. Therefore, the present invention is not limited to the provision of the phase component extraction unit 11. The target acoustic signal conversion unit 13 converts the target acoustic complex spectrogram into a target acoustic signal that is a time signal.
 本実施の形態では、共通スパース成分推定部7で、反復推定法を用いる。そのため共通スパース成分推定部7は、低ランク成分比率演算部71と、低ランク成分演算部72と、スパース成分比率演算部73と、残余成分演算部74と、音量演算部75と、共通スパース成分演算部76とを備える。低ランク成分比率演算部71は、Mチャネルの振幅スペクトログラムに含まれる低ランク成分の比率を演算する。低ランク成分演算部72は、低ランク成分の比率に基づいてMチャネルの振幅スペクトログラムに含まれるM個の低ランク成分を演算する。スパース成分比率演算部73は、Mチャネルの振幅スペクトログラムに含まれるスパース成分の比率を演算する。そして残余成分演算部74は、スパース成分の比率に基づいてMチャネルの振幅スペクトログラムに含まれるM個のスパース成分を含む残余成分を演算する。音量演算部75は、M個のスパース成分を含む残余成分と共通スパース成分とに基づいて、共通スパース成分にM個のスパース成分が含まれている含有比率をM個のスパース成分の音量として演算する。この含有比率は、例えば、変分ベイズEM法や逐次変分ベイズEM法等の反復推定法を用いる際に、反復推定の過程で求める。共通スパース成分演算部76は、Mチャネルの振幅スペクトログラムからそれぞれ反復推定i-1回目の低ランク成分を除いて得たM個のスパース成分を含む残余の総和を、共通スパース成分にM個のスパース成分が含まれている含有比率の総和で除算して得た結果を反復推定i回目の共通スパース成分として演算する。 In this embodiment, the common sparse component estimation unit 7 uses an iterative estimation method. Therefore, the common sparse component estimation unit 7 includes a low rank component ratio calculation unit 71, a low rank component calculation unit 72, a sparse component ratio calculation unit 73, a residual component calculation unit 74, a volume calculation unit 75, and a common sparse component. And an arithmetic unit 76. The low rank component ratio calculation unit 71 calculates the ratio of the low rank components included in the amplitude spectrogram of the M channel. The low rank component calculation unit 72 calculates M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components. The sparse component ratio calculation unit 73 calculates the ratio of sparse components included in the amplitude spectrogram of the M channel. Then, the residual component calculation unit 74 calculates a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components. The volume calculation unit 75 calculates a content ratio in which the M sparse components are included in the common sparse component as the volume of the M sparse components based on the residual component including the M sparse components and the common sparse component. To do. This content ratio is obtained in the process of iterative estimation when using an iterative estimation method such as a variational Bayes EM method or a sequential variational Bayes EM method. The common sparse component calculation unit 76 uses the sum of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation from the amplitude spectrogram of the M channel as M common sparse components. The result obtained by dividing by the sum of the content ratios containing the components is calculated as a common sparse component for i-th iteration estimation.
 そして低ランク成分比率演算部71、低ランク成分演算部72、スパース成分比率演算部73、残余成分演算部74、音量演算部75及び共通スパース成分演算部76において反復演算を行うことにより共通スパース成分を推定する。 Then, the low rank component ratio calculation unit 71, the low rank component calculation unit 72, the sparse component ratio calculation unit 73, the residual component calculation unit 74, the volume calculation unit 75, and the common sparse component calculation unit 76 perform an iterative calculation to thereby perform a common sparse component. Is estimated.
 本実施の形態では、共通スパース成分推定部7では、「各チャネルの音響信号が低ランク成分とチャネル間共通の共通スパース成分に分解される」と仮定して共通スパース成分を推定し、この共通スパース成分の位相を位相復元部9で復元し、復元した目的音響複素スペクトログラムを目的音響信号変換部13で変換することにより、雑音抑圧を行う。本実施の形態では、共通スパース成分は、共通スパース成分に含まれている各チャネルのスパース成分の含有比率(音量)のみ違うものとモデル化することにより、この含有比率(音量)の推定により目的音響信号(音声等)を十分に収録できないマイクロホンが存在しても頑強な音声強調を実現する。 In the present embodiment, the common sparse component estimation unit 7 estimates the common sparse component on the assumption that “the acoustic signal of each channel is decomposed into a low rank component and a common sparse component common between channels”. The phase of the sparse component is restored by the phase restoring unit 9, and the restored target acoustic complex spectrogram is converted by the target acoustic signal converting unit 13 to perform noise suppression. In the present embodiment, the common sparse component is modeled as different from the content ratio (volume) of the sparse component of each channel included in the common sparse component. Even if there is a microphone that cannot sufficiently record acoustic signals (speech, etc.), robust speech enhancement is realized.
 具体的に、本発明では、雑音の事前情報無しで雑音の低ランク性と目的音響信号(音声等)のスパース性から目的音響信号(音声等)強調を行う。本発明においては、Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース周波数成分を含む共通スパース成分を推定するため、低ランク成分の影響を極力受けることなく目的音響信号を復元することができる。その結果、復元の精度を従来よりも高くすることができる。 Specifically, in the present invention, the target sound signal (speech etc.) is enhanced from the low rank of noise and the sparsity of the target sound signal (speech etc.) without prior information of noise. In the present invention, since the common sparse component including the sparse frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels is estimated, the influence of the low rank component is minimized. The target acoustic signal can be restored without receiving it. As a result, the restoration accuracy can be made higher than before.
 本実施の形態では、非負実数値行列の解析に特化しており、雑音信号の音色情報を事前に収録しなくても、雑音の低ランク性と音声等の目的音響信号のスパース性から頑健に目的音響信号強調を実現できる。本特徴により、マイクロホンアレイの周囲に障害物が多い環境や、事前に雑音に関する情報を得られない場合でも頑健に動作する目的音響信号(音声等)強調を実現できる。さらに一部のマイクロホンが障害物等で音声を十分大きな音量で収録できない場合でも頑健に目的音響信号(音声等)を強調できる。 This embodiment specializes in the analysis of non-negative real value matrices, and is robust from the low rank of noise and the sparseness of the target acoustic signal such as speech without having to record the timbre information of the noise signal in advance. Target sound signal enhancement can be realized. With this feature, it is possible to realize enhancement of target acoustic signals (speech etc.) that operate robustly even in environments where there are many obstacles around the microphone array or when information on noise cannot be obtained in advance. Furthermore, even when some microphones cannot record sound at a sufficiently high volume due to obstacles or the like, the target sound signal (sound or the like) can be strongly emphasized.
 (生成モデルの説明)
 本実施の形態で採用する生成モデルについて説明する。なお以下の説明では便宜上目的音響信号を音声信号ということがある。この生成モデルで扱う音声強調の問題は以下で定義される。
(Description of generation model)
A generation model employed in the present embodiment will be described. In the following description, the target acoustic signal may be referred to as an audio signal for convenience. The speech enhancement problem handled by this generation model is defined below.
  入力:Mチャネルの振幅スペクトログラム Input: M channel amplitude spectrogram
Figure JPOXMLDOC01-appb-M000001
  出力:目的音声の振幅スペクトログラム
Figure JPOXMLDOC01-appb-M000001
Output: Amplitude spectrogram of target speech
Figure JPOXMLDOC01-appb-M000002
 ここで、
Figure JPOXMLDOC01-appb-M000002
here,
Figure JPOXMLDOC01-appb-M000003
は非負実数値を表す。また、FおよびTはそれぞれ周波数ビン数、時間フレーム数を表す。以下の説明では、チャネルの振幅スペクトログラムのことを音響信号と呼び、fおよびtを周波数ビン、時間フレームのインデックスとする。
Figure JPOXMLDOC01-appb-M000003
Represents a non-negative real value. F and T represent the number of frequency bins and the number of time frames, respectively. In the following description, an amplitude spectrogram of a channel is called an acoustic signal, and f and t are frequency bins and time frame indexes.
 本実施の形態では、音声信号S = [s1,・・・,sT ]とそのマイクロホンでの観測Y´m=[y´m1,・・・,y´mT ]の関係は、周波数非依存時変線形変換であると仮定する。 In this embodiment, the audio signal S = [s 1, ···, s T] Observation at its microphone Y'm = [y'm1, ··· , y'mT] relationship, the frequency non Assume a dependent time-varying linear transformation.
Figure JPOXMLDOC01-appb-M000004
 ここで、
Figure JPOXMLDOC01-appb-M000004
here,
Figure JPOXMLDOC01-appb-M000005
は本線形変換の係数(各マイクロホンでの音声の音量:各マイクロホンにおけるスパース成分の含有比率) を表す。本音声伝達モデルを用いて、マイクロホンの観測ymt は以下のように分解されると仮定する。
Figure JPOXMLDOC01-appb-M000005
Represents the coefficient of this linear conversion (volume of sound in each microphone: content ratio of sparse component in each microphone). Using this audio transmission model, it is assumed that the microphone observation y mt is decomposed as follows.
Figure JPOXMLDOC01-appb-M000006
 ここで、Lm = [lm1,・・・,lmT ]およびS =[s1,・・・,sT]はそれぞれ各マイクロホンに混入した低ランク雑音と、音声を表すスパース成分を表す。本低ランク成分は更にK個の基低Wm = [wm1,・・・,wmK](基低行列) および、各基低の各時刻での音量Hm = [hm1,・・・,hmT ] (アクティベーション行列) の積で表現する。
Figure JPOXMLDOC01-appb-M000006
Here, L m = [l m1 ,..., L mT ] and S = [s 1 ,..., S T ] respectively represent low rank noise mixed in each microphone and a sparse component representing speech. . This low rank component further includes K fundamental lows W m = [w m1 ,..., W mK ] (basic low matrix) and volume H m = [h m1 ,. •, h mT ] (activation matrix).
Figure JPOXMLDOC01-appb-M000007
 以降では、各チャネルの低ランク成分の低ランクらしさ及び、共通スパース成分のスパースらしさをモデル化するためのベイズ生成モデルについて説明する。
Figure JPOXMLDOC01-appb-M000007
Hereinafter, a Bayesian generation model for modeling the low rank likelihood of the low rank components of each channel and the sparseness of the common sparse components will be described.
 (尤度モデル)
 本モデルでは入力振幅スペクトログラムの近似誤差をKullback-Leibler (KL) 擬距離を用いて評価する。ベイズ生成モデルではKL 擬距離の最小化は、Poisson 分布尤度の最大化に対応するので、本モデルでは尤度モデルを以下のように定義する。
(Likelihood model)
In this model, the approximation error of the input amplitude spectrogram is evaluated using the Kullback-Leibler (KL) pseudorange. In the Bayesian generation model, minimizing the KL pseudorange corresponds to maximizing the Poisson distribution likelihood. In this model, the likelihood model is defined as follows.
Figure JPOXMLDOC01-appb-M000008
 ここで、P(x|k) はパラメータ
Figure JPOXMLDOC01-appb-M000008
Where P (x | k) is a parameter
Figure JPOXMLDOC01-appb-M000009
 を持つPoisson 分布を表す。(4)式において、YmはMチャネルの振幅スペクトログラムのm番目の振幅スペクトログラム,Hmは各基底の各時刻での音量、WmはK個の基底、Sは共通スパース成分、Gmはスパース成分の各マイクロホンにおける含有比率である。sft,gmt,wmfk,hmkt はそれぞれ、S、Gm、Wm、Hmの要素を表し,ymft は,観測の複素スペクトログラムを表す。
Figure JPOXMLDOC01-appb-M000009
Represents a Poisson distribution with. In equation (4), Ym is the mth amplitude spectrogram of the M channel amplitude spectrogram, Hm is the volume of each base at each time, Wm is the K bases, S is the common sparse component, and Gm is the sparse component. It is the content ratio in the microphone. s ft, g mt, w mfk , h mkt each represents S, Gm, Wm, the elements of Hm, y mft denotes a complex spectrogram of the observation.
 (低ランク成分の事前分布)
 低ランク成分の基底行列とアクティベーション行列の事前分布は、Poisson 分布の共役事前分布であるGamma 分布を用いて定式化する。
(Prior distribution of low rank components)
The prior distribution of the low-rank component basis matrix and activation matrix is formulated using the Gamma distribution, which is the conjugate prior distribution of the Poisson distribution.
Figure JPOXMLDOC01-appb-M000010
 ここで、G(x|α,β) はshapeパラメータα およびrate パラメータβを持つガンマ分布を表す。また、
Figure JPOXMLDOC01-appb-M000010
Here, G (x | α, β) represents a gamma distribution having a shape parameter α and a rate parameter β. Also,
Figure JPOXMLDOC01-appb-M000011
および
Figure JPOXMLDOC01-appb-M000011
and
Figure JPOXMLDOC01-appb-M000012
は基底とアクティベーションの超パラメータを表す。本モデルでは、shape パラメータを1以下に設定することで基底とアクティベーション行列をスパースに制限でき、これによって低ランク成分Lは低ランク行列に制限される。
Figure JPOXMLDOC01-appb-M000012
Represents the hyperparameters of the basis and activation. In this model, by setting the shape parameter to 1 or less, the basis and the activation matrix can be limited to sparse, whereby the low rank component L is limited to the low rank matrix.
 (スパース成分の事前分布)
 従来法の一つであるBayesian RPCA の生成モデルでは、スパース成分の表現にGaussian 事前分布を置き、その精度パラメータにJeffreys 超事前分布を置くことでスパース成分を表現していた。本実施の形態では、非負値行列である振幅スペクトログラムを表現するためにスパース成分の事前分布にはGamma 分布を置き、Gauss 分布における精度パラメータに対応するGamma 分布のrate パラメータにJeffreys 超事前分布を置くことでスパース成分をモデル化する。
(A prior distribution of sparse components)
In the generation model of Bayesian RPCA, which is one of the conventional methods, the Gaussian prior distribution is placed in the representation of the sparse component, and the sparse component is represented by placing the Jeffreys hyper prior distribution in the accuracy parameter. In this embodiment, in order to express the amplitude spectrogram that is a non-negative matrix, the Gamma distribution is placed in the prior distribution of the sparse component, and the Jeffreys hyper prior distribution is placed in the rate parameter of the Gamma distribution corresponding to the accuracy parameter in the Gaussian distribution. To model the sparse component.
Figure JPOXMLDOC01-appb-M000013
 ここで、
Figure JPOXMLDOC01-appb-M000013
here,
Figure JPOXMLDOC01-appb-M000014
はガンマ分布の超パラメータを表す。提案モデルでは、スパース成分のスパース度をこの超パラメータの値で調節する。
Figure JPOXMLDOC01-appb-M000014
Represents the hyperparameter of the gamma distribution. In the proposed model, the sparseness of the sparse component is adjusted by the value of this hyperparameter.
 (音量変数の事前分布)
 各マイクロホンのスパース成分の音量(含有比率)gmtには、Poisson 分布の共役事前分布であるGamma 分布を置く。
(Prior distribution of volume variable)
Gamma distribution, which is a conjugate prior distribution of Poisson distribution, is placed in the volume (content ratio) g mt of the sparse component of each microphone.
Figure JPOXMLDOC01-appb-M000015
 ここで、αg は各マイクロホンでのスパース成分の音量のばらつきを調整する超パラメータである。
Figure JPOXMLDOC01-appb-M000015
Here, α g is a super parameter for adjusting the variation in the volume of the sparse component in each microphone.
 (変分ベイズEM 法によるベイズ推論)
 入力多チャネル振幅スペクトログラムが得られたときの本モデルの事後分布を解析的に導出することは困難なので、変分ベイズEM 法により近似推論を行う。以下では
(Bayesian inference by variational Bayes EM method)
Since it is difficult to analytically derive the posterior distribution of this model when the input multichannel amplitude spectrogram is obtained, approximate inference is performed using the variational Bayes EM method. Below
Figure JPOXMLDOC01-appb-M000016
を全てのパラメータの集合を表し、q(・)を変分事後分布とする。変分近似では事後分布を以下のように分解近似し、真の事後分布とのKL 擬距離を最小化することで推論を行う。
Figure JPOXMLDOC01-appb-M000016
Represents a set of all parameters, and q (•) is a variational posterior distribution. In variational approximation, the posterior distribution is decomposed and approximated as follows, and inference is performed by minimizing the KL pseudorange with the true posterior distribution.
Figure JPOXMLDOC01-appb-M000017
 本実施の形態のモデルでは、共役指数分布族上でモデル化されているため、各更新則はJensen の不等式とLagrange 未定乗数法を用いることで容易に導出できる。〈・〉 を確率変数の平均とするとき、各辺分事後分布は以下の更新則を他のパラメータを固定して反復更新することで得られる。
Figure JPOXMLDOC01-appb-M000017
Since the model of this embodiment is modeled on the conjugate exponential distribution family, each update rule can be easily derived by using Jensen's inequality and Lagrange's undetermined multiplier method. When 〈·〉 is the mean of random variables, each posterior distribution is obtained by iteratively updating the following update rule with other parameters fixed.
Figure JPOXMLDOC01-appb-M000018
 ここで、s´mft は低ランク成分の残余を表し、
Figure JPOXMLDOC01-appb-M000018
Where s´ mft represents the remainder of the low rank component,
Figure JPOXMLDOC01-appb-M000019
および
Figure JPOXMLDOC01-appb-M000019
and
Figure JPOXMLDOC01-appb-M000020
はそれぞれ低ランク成分が含まれている比率及びスパース成分が含まれている比率を表す。
Figure JPOXMLDOC01-appb-M000020
Respectively represent a ratio including a low-rank component and a ratio including a sparse component.
 図2には、図1の共通スパース成分推定部7をコンピュータを用いて反復推定法により実現する場合に用いるコンピュータプログラムのアルゴリズムを示すフローチャートを示してある。図2には、上記(11)式~(18)式までが使用されるステップに、各式の表示を付してある。反復推定の終了条件は、200回繰り返し、または各推定値のデータYm,Hm、Wm、S、βs,gmについて、前回処理時との比較を行い、比較結果が近似になったら終了するものとした。なお図2においても、YmはMチャネルの振幅スペクトログラムのm番目の振幅スペクトログラム,Hmは各基底の各時刻での音量、WmはK個の基底、Sは共通スパース成分、βsはベイズ推定の係数,gmはスパース成分の含有比率である。 FIG. 2 is a flowchart showing an algorithm of a computer program used when the common sparse component estimation unit 7 of FIG. 1 is realized by an iterative estimation method using a computer. In FIG. 2, each expression is attached to the steps in which the above expressions (11) to (18) are used. The iterative estimation termination condition is repeated 200 times, or each estimated value data Ym, Hm, Wm, S, β s , g m is compared with the previous processing, and ends when the comparison result is approximate. It was supposed to be. In FIG. 2, Ym is the m-th amplitude spectrogram of the M-channel amplitude spectrogram, Hm is the volume of each base at each time, Wm is K bases, S is the common sparse component, and β s is Bayesian estimation. The coefficient, g m, is the content ratio of the sparse component.
 共通スパース成分演算部76で行う、Mチャネルの振幅スペクトログラムからそれぞれ反復推定i-1回目の低ランク成分を除いて得たM個のスパース成分を含む残余の総和の演算は、式(13)で更新される変分事後分布の平均を計算する上で、式(13)中の「,」の前の式を用いて実施される。そして残余の総和を、共通スパース成分にM個のスパース成分が含まれている含有比率(音量)の総和で除算する演算は、式(13)で得られる変分事後分布の平均を計算するときに実施される。この結果を、反復推定i回目の共通スパース成分として推定する。含有比率の総和[式(13)中のΣ(gmt)]は、変分ベイズEM法等の反復推定法を用いる際に、反復推定の過程で徐々に収束することになる。式(13)中のβs ftはベイズ推定の係数である。 The calculation of the sum total of the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation from the M channel amplitude spectrograms, respectively, performed by the common sparse component calculation unit 76 is given by Equation (13). In calculating the average of the updated variational posterior distribution, the calculation is performed using the expression before “,” in Expression (13). Then, the operation of dividing the sum of the residuals by the sum of the content ratios (volumes) in which M sparse components are included in the common sparse component is to calculate the average of the variational posterior distribution obtained by Equation (13) To be implemented. This result is estimated as a common sparse component of the i-th iteration estimation. The sum of the content ratios [Σ (gmt) in equation (13)] gradually converges in the process of iterative estimation when an iterative estimation method such as the variational Bayes EM method is used. Β s ft in equation (13) is a coefficient for Bayesian estimation.
 図2の場合、終了条件が成立したら位相復元器9に、共通スパース成分Sが与えられ、位相復元器9は式(19)で目的音響スペクトログラムs´ft(出力)を計算する。 In the case of FIG. 2, when the termination condition is satisfied, the common sparse component S is given to the phase restorer 9, and the phase restorer 9 calculates the target acoustic spectrogram s ′ ft (output) by Equation (19).
Figure JPOXMLDOC01-appb-M000021
 なお上記式において、sft はそれぞれ、Sの要素を表し,y′mft は,観測の複素スペクトログラムの要素を表す。
Figure JPOXMLDOC01-appb-M000021
In the above equation, s ft represents an element of S, and y ′ mft represents an element of an observation complex spectrogram.
 また位相復元部9は、共通スパース成分Sの他に、各基底の各時刻での音量Hm、K個の基底Wm、スパース成分の含有比率gmを含めて目的音響スペクトログラムs´ft(出力)を復元してもよい。この場合、位相復元部9は式(20)で目的音響スペクトログラムSftを計算する。 In addition to the common sparse component S, the phase restoration unit 9 includes the target acoustic spectrogram s ′ ft (output) including the volume Hm of each base at each time, K bases Wm, and the content ratio g m of the sparse components. May be restored. In this case, the phase restoration unit 9 calculates the target acoustic spectrogram S ft using equation (20).
Figure JPOXMLDOC01-appb-M000022
 なお上記式において、sft,gmt,wmfk,hmkt はそれぞれ、S、gm、Wm、Hmの要素を表し,y′mft は,観測の複素スペクトログラムの要素を表す。
Figure JPOXMLDOC01-appb-M000022
Note In the above formula, s ft, g mt, w mfk, h mkt each represents S, g m, Wm, the elements of Hm, y 'mft represents an element of the complex spectrogram of the observation.
 [評価実験]
 駆動機構とマイクロホンアレイを有する柔軟索状ロボットの動作雑音を用いて本実施の形態の音声強調性能を評価した。
[Evaluation experiment]
The speech enhancement performance of the present embodiment was evaluated using the operation noise of a flexible cable-like robot with a drive mechanism and a microphone array.
 (使用した柔軟索状ロボット)
 図3に、マイクロホンアレイを搭載した柔軟索状ロボットの写真を示す。本体は、直径38mmのコルゲートチューブからなり、全長3mである。8本のマイクロホンアレイ(M=8) をロボット表面に40cm間隔で90度ずつ回転して装着した。両端のマイクロホン間の距離は2.8mである。マイクロホンは手元から順番にインデックスmで区別する(m = 1,・・・,M)。本ロボットは、Namari らのTube-type Active Scope Camera [J. Fukuda, et al. Remote vertical exploration by active scope camera into collapsed buildings. In IEEE/RSJ IROS, pp. 1882-1888, 2014.] と同様、繊毛と振動モータを用いた駆動で前進する。振動モータはロボット内に40cm間隔で7個直列に装着されている。
(Flexible cable robot used)
FIG. 3 shows a photograph of a flexible cable-like robot equipped with a microphone array. The main body consists of a corrugated tube with a diameter of 38 mm and has a total length of 3 m. Eight microphone arrays (M = 8) were mounted on the robot surface by rotating 90 degrees at 40 cm intervals. The distance between the microphones at both ends is 2.8 m. The microphones are distinguished by the index m in order from the hand (m = 1,..., M). This robot is similar to Namari et al.'S Tube-type Active Scope Camera [J. Fukuda, et al. Remote vertical exploration by active scope camera into collapsed buildings. In IEEE / RSJ IROS, pp. 1882-1888, 2014.] Advancing with drive using cilia and vibration motor. Seven vibration motors are mounted in series in the robot at intervals of 40 cm.
 (実験設定)
 (録音条件)
 柔軟索状ロボットを用いて音声と動作雑音を個別に録音し、SNRを-20dB から+5dB まで5dB 分ずつ変化させて混合し、音声の強調性能を評価した。図4に示すように、ロボットと音声を再生するスピーカ(音源)の配置条件を条件1と条件2のように定めた。
(Experimental settings)
(Recording conditions)
Voice and motion noise were individually recorded using a flexible cable-shaped robot, and the SNR was changed by 5 dB from -20 dB to +5 dB and mixed to evaluate the speech enhancement performance. As shown in FIG. 4, the arrangement conditions of the robot and the speaker (sound source) for reproducing the sound are defined as Condition 1 and Condition 2.
 条件1:ロボットは自由空間に配置され、音源はロボットの正面に配置されている。部屋の残響時間(RT60) は750 ms だった。 Requirement 1: The robot is placed in free space and the sound source is placed in front of the robot. The room reverberation time (RT60) was 750ms.
 条件2:ロボットはドアの隙間に配置され、音源はロボットの正面に配置されている。4つのマイクロホンがドアにより音源から遮られている。残響時間(RT60)は990 ms だった。 Requirement 2: The robot is placed in the door gap and the sound source is placed in front of the robot. Four microphones are blocked from the sound source by the door. The reverberation time (RT60) was 990 ms.
 ロボットを駆動させ、手と振動モータを使って左右にロボット振りながら、60 秒の動作雑音を録音した。目的音である音声は、ノイズを軽減するために、ロボットが静止時のインパルス応答に60 秒の録音音声を畳み込んで作成した。録音音声は男声2種、女声2種の計4種(240 秒) を用いた。これらの録音は8ch同期、24 bit 量子化、16 kHz サンプリングで行った。 ¡The robot was driven, and 60-second operation noise was recorded while shaking the robot left and right using the hand and vibration motor. In order to reduce noise, the target sound was created by convolving a 60-second recording with the impulse response when the robot was stationary. A total of 4 types (240 msec) of male voice and 2 female voices were used for recording. These recordings were performed with 8ch synchronization, 24-bit quantization, and 16 kHz sampling.
 (比較手法)
 実験では、本実施の形態の実施例と、Multi-channel non-negative matrix factorization (MNMF) [D. Kitamura, et al. Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model. In IEEE ICASSP, pp. 276-280, 2015. ]及びrobust principal component analysis (RPCA) [C. Sun, et al. Noise reduction based on robust principal component analysis. JCIS, Vol. 10, No. 10, pp.4403-4410, 2014.2] による比較例とを比較した。RPCAは先端のマイクの結果を使用した。更に、全マイクのRPCA の結果を中央値で統合した結果(Med-RPCA) とも比較した。本実施の形態の実施例では、従来法[坂東宜昭ほか.ロバスト主成分分析を用いた動作雑音抑圧に基づく柔軟索状ロボットのための音声強調. In RSJ2015] のオフライン実装となっている。
(Comparison method)
In the experiment, an example of this embodiment and Multi-channel non-negative matrix factorization (MNMF) [D. Kitamura, et al. Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model. In IEEE ICASSP, pp. 276 -280, 2015.] and robust principal component analysis (RPCA) [C. Sun, et al. Noise reduction based on robust principal component analysis. JCIS, Vol. 10, No. 10, pp.4403-4410, 2014.2] The comparative example was compared. RPCA used the result of the tip microphone. In addition, we compared the result of RPCA of all microphones with the median (Med-RPCA). The example of this embodiment is an offline implementation of the conventional method [Yoshiaki Bando et al. Speech enhancement for flexible cable-like robots based on motion noise suppression using robust principal component analysis. In RSJ2015].
 (評価尺度)
 評価尺度には、信号対歪比(SDR) を用いた。信号対歪比(SDR)は総合的な分離精度を表す。
(Evaluation scale)
The signal-to-distortion ratio (SDR) was used as the evaluation scale. Signal to distortion ratio (SDR) represents the overall separation accuracy.
 (実験結果)
 図5には、条件1と条件2の配置条件及びSNR 条件での音声強調性能を信号対歪比(SDR)で示してある。各配置・SNR 条件での音声強調性能をSDR で示した場合、SDR が高いほど、音声強調性能が良いことを表す。すなわちSNRが高いほど音声が多く含まれていることを意味する。条件では、SNR が0 dB 以下のとき、条件2ではSNR が-15 dB 以上で0 dB 以下のときに本実施の形態(提案法)の実施例では性能が最も高い。これに対して、条件1及び条件2のいずれにおいても、2番目に性能が高いMed-RPCA の比較例では、一部のマイクロホンが隠れている条件2では性能が大きく劣化している。一方、条件2で3番目に性能が高いRPCA は条件2では、Med-RPCA や提案法より性能が低い。これらに比べて、本実施の形態の実施例(提案法)は両方の条件で高い性能を示しており、環境への依存性が低いことがわかる。
(Experimental result)
FIG. 5 shows the speech enhancement performance under the condition 1 and the arrangement condition 2 and the SNR condition as a signal-to-distortion ratio (SDR). When the speech enhancement performance under each arrangement / SNR condition is shown in SDR, the higher the SDR, the better the speech enhancement performance. In other words, the higher the SNR, the more voices are included. In the condition, when the SNR is 0 dB or less, and in the condition 2, when the SNR is -15 dB or more and 0 dB or less, the performance of the example of the present embodiment (proposed method) is the highest. On the other hand, in both the condition 1 and the condition 2, in the comparative example of Med-RPCA having the second highest performance, the performance is greatly deteriorated in the condition 2 where some microphones are hidden. On the other hand, RPCA, which has the third highest performance under Condition 2, has lower performance than Med-RPCA and the proposed method under Condition 2. Compared to these, the example of the present embodiment (the proposed method) shows high performance under both conditions, and it can be seen that the dependence on the environment is low.
 図6乃至図13には、本実施の形態の実施例(提案法)による音声強調結果および、従来法による強調結果を波形で示す。これらの波形をみれば、本実施の形態の実施例が、最も雑音を抑圧し、音声を強調できていることがわかる。図6は、多チャネルの振幅スペクトログラムYm(m = 1,・・・,8)の8チャネル中4チャネルを抜粋したものである。図7は、8チャネル中4チャネルの低ランク成分Lm を抜粋したものである。図8は、 8チャネル中4チャネルの共通スパース成分での音量gm を示したものである。図9は、 共通スパース成分を示したものである。そして図10は、MNMF での強調結果を示しており、図11はMed-RPCA での強調結果を示しており、図12はRPCA での強調結果を示している。そして図13は、共通スパース成分の位相を復元して、目的音響複素スペクトログラムとし、この目的音響複素スペクトログラムを時間信号である目的音響信号に変換した結果を示している。 6 to 13 show the speech enhancement results according to the example of the present embodiment (the proposed method) and the enhancement results according to the conventional method as waveforms. From these waveforms, it can be seen that the example of the present embodiment can suppress noise most and emphasize the voice. FIG. 6 is an excerpt of 4 channels out of 8 channels of multi-channel amplitude spectrogram Ym (m = 1,..., 8). FIG. 7 is an extract of low rank components Lm 成分 of 4 channels out of 8 channels. FIG. 8 shows the volume gm with the common sparse component of 4 channels out of 8 channels. FIG. 9 shows a sparse common sparse component. FIG. 10 shows the enhancement result with MNMF, FIG. 11 shows the enhancement result with Med-RPCA, and FIG. 12 shows the enhancement result with RPCA. FIG. 13 shows a result of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram, and converting the target acoustic complex spectrogram into a target acoustic signal that is a time signal.
 本発明では、雑音の事前情報無しで雑音の低ランク性と目的音響信号のスパース性から音声強調を行うため、Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース周波数成分を含む共通スパース成分を推定する。そして共通スパース成分の位相を復元し、且つ復元した目的音響複素スペクトログラムを目的音響信号に変換することにより、雑音抑圧を行う、そのため低ランク成分の影響を極力受けることなく目的音響信号を復元することができ、復元の精度を従来よりも高くすることができる。 In the present invention, speech enhancement is performed without noise prior information from the low rank of noise and the sparsity of the target acoustic signal. Therefore, it is commonly included in the amplitude spectrogram of the most channels among the M channel amplitude spectrograms. Estimate common sparse components, including likely sparse frequency components. Then, by restoring the phase of the common sparse component and converting the restored target acoustic complex spectrogram to the target acoustic signal, noise suppression is performed, so that the target acoustic signal is recovered without being affected by the low-rank component as much as possible. And the accuracy of restoration can be made higher than before.
 1 目的音響信号復元システム
 3 時間周波数解析部
 5 振幅成分抽出部
 7 共通スパース成分推定部
 9 位相復元部
 11 位相成分抽出部
 13 目的音響信号変換部
 71 低ランク成分比率演算部
 72 低ランク成分演算部
 73 スパース成分比率演算部
 74 残余成分演算部
 75 音量演算部
 76 共通スパース成分演算部
DESCRIPTION OF SYMBOLS 1 Target acoustic signal restoration system 3 Time frequency analysis part 5 Amplitude component extraction part 7 Common sparse component estimation part 9 Phase restoration part 11 Phase component extraction part 13 Target acoustic signal conversion part 71 Low rank component ratio calculation part 72 Low rank component calculation part 73 Sparse Component Ratio Calculation Unit 74 Residual Component Calculation Unit 75 Volume Calculation Unit 76 Common Sparse Component Calculation Unit

Claims (8)

  1.  M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する目的音響信号復元システムであって、
     前記M本(Mは2以上の整数)のマイクロホンによって採取した前記Mチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る時間周波数解析部と、
     前記Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する振幅成分抽出部と、
     前記Mチャネルの振幅スペクトログラムを入力として、前記Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース時間周波数成分を含む共通スパース成分を推定する共通スパース成分推定部と、
     前記共通スパース成分の位相を復元して目的音響複素スペクトログラムとする位相復元部と、
     前記目的音響複素スペクトログラムを時間信号である前記目的音響信号に変換する目的音響信号変換部とからなる目的音響信号復元システム。
    A target acoustic signal restoration system for restoring a target acoustic signal included in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones,
    A time-frequency analysis unit that obtains an M-channel complex spectrogram by performing time-frequency analysis of the M-channel acoustic signal collected by the M microphones (M is an integer of 2 or more);
    An amplitude component extractor for extracting an M channel amplitude spectrogram from the M channel complex spectrogram;
    A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels, using the amplitude spectrogram of the M channels as an input. A component estimation unit;
    A phase restoring unit that restores the phase of the common sparse component to obtain a target acoustic complex spectrogram;
    A target sound signal restoration system comprising: a target sound signal conversion unit that converts the target sound complex spectrogram into the target sound signal that is a time signal.
  2.  前記共通スパース成分推定部は、前記Mチャネルの振幅スペクトログラムからそれぞれ反復推定i-1回目(iは2以上の正数)の低ランク成分を除いて得たM個のスパース成分を含む残余の総和を、前記共通スパース成分に前記M個のスパース成分が含まれている含有比率の総和で除算して得た結果を反復推定i回目の前記共通スパース成分として推定する請求項1に記載の目的音響信号復元システム。 The common sparse component estimation unit sums the residuals including the M sparse components obtained by removing the low rank components of the i-th iteration estimation (i is a positive number of 2 or more) from the amplitude spectrogram of the M channel. 2. The target sound according to claim 1, wherein a result obtained by dividing the common sparse component by the sum of the content ratios in which the M sparse components are included in the common sparse component is estimated as the common sparse component for i-th iteration estimation. Signal restoration system.
  3.  前記共通スパース成分推定部は、
     前記Mチャネルの振幅スペクトログラムに含まれる低ランク成分の比率を演算する低ランク成分比率演算部と、
     前記低ランク成分の比率に基づいて前記Mチャネルの振幅スペクトログラムに含まれるM個の低ランク成分を演算する低ランク成分演算部と、
     前記Mチャネルの振幅スペクトログラムに含まれるスパース成分の比率を演算するスパース成分比率演算部と、
     前記スパース成分の比率に基づいて前記Mチャネルの振幅スペクトログラムに含まれるM個のスパース成分を含む残余成分を演算する残余成分演算部と、
     前記M個のスパース成分を含む残余成分と前記共通スパース成分とに基づいて、前記共通スパース成分に前記M個のスパース成分が含まれている含有比率を前記M個のスパース成分の音量として演算する音量演算部と、
     前記M個のスパース成分を含む残余成分の総和を前記M個のスパース成分の音量の総和で除算して前記共通スパース成分を演算する共通スパース成分演算部とを備えて、
     前記低ランク成分比率演算部、低ランク成分演算部、スパース成分比率演算部、残余成分演算部、音量演算部及び前記共通スパース成分演算部において反復演算を行うことにより前記共通スパース成分を推定する請求項2に記載の音声信号復元システム。
    The common sparse component estimator is
    A low rank component ratio calculation unit for calculating a ratio of low rank components included in the amplitude spectrogram of the M channel;
    A low rank component calculation unit for calculating M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components;
    A sparse component ratio calculation unit for calculating a ratio of sparse components included in the amplitude spectrogram of the M channel;
    A residual component calculation unit that calculates a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components;
    Based on the residual component including the M sparse components and the common sparse component, a content ratio in which the M sparse components are included in the common sparse component is calculated as a volume of the M sparse components. A volume calculator,
    A common sparse component computing unit that computes the common sparse component by dividing the sum of the residual components including the M sparse components by the sum of the volumes of the M sparse components,
    The common sparse component is estimated by performing iterative calculations in the low rank component ratio calculation unit, the low rank component calculation unit, the sparse component ratio calculation unit, the residual component calculation unit, the volume calculation unit, and the common sparse component calculation unit. Item 3. The audio signal restoration system according to Item 2.
  4.  前記共通スパース成分推定部は、変分ベイズEM法または逐次変分ベイズEM法により前記共通スパース成分をベイズ推定するベイズ推定器によって構成されている請求項1に記載の音声信号復元システム。 The speech signal restoration system according to claim 1, wherein the common sparse component estimation unit includes a Bayes estimator that performs Bayes estimation of the common sparse component by a variational Bayes EM method or a sequential variational Bayes EM method.
  5.  コンピュータを用いて、M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する目的音響信号復元方法であって、
     前記M本(Mは2以上の整数)のマイクロホンによって採取した前記Mチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る時間周波数解析ステップと、
     前記Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する振幅成分抽出ステップと、
     前記Mチャネルの振幅スペクトログラムを入力として、前記Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース時間周波数成分を含む共通スパース成分を推定する共通スパース成分推定ステップと、
     前記共通スパース成分の位相を復元して目的音響複素スペクトログラムとする位相復元ステップと、
     前記目的音響複素スペクトログラムを時間信号である前記目的音響信号に変換する目的音響信号変換ステップとからなる目的音響信号復元方法。
    A target acoustic signal restoration method for restoring a target acoustic signal contained in an M channel acoustic signal collected by M (M is an integer of 2 or more) microphones using a computer,
    A time-frequency analysis step of obtaining a M-channel complex spectrogram by performing time-frequency analysis of the M-channel acoustic signal collected by the M microphones (M is an integer of 2 or more);
    An amplitude component extracting step of extracting an M channel amplitude spectrogram from the M channel complex spectrogram;
    A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels, using the amplitude spectrogram of the M channels as an input. Component estimation step;
    A phase restoration step of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram;
    A target sound signal restoring method comprising: a target sound signal converting step of converting the target sound complex spectrogram to the target sound signal which is a time signal.
  6.  前記共通スパース成分推定ステップでは、前記Mチャネルの振幅スペクトログラムからそれぞれ反復推定i-1回目の低ランク成分を除いて得たM個のスパース成分を含む残余の総和を、前記共通スパース成分に前記M個のスパース成分が含まれている含有比率の総和で除算して得た結果を反復推定i回目の前記共通スパース成分として推定する請求項5に記載の目的音響信号復元方法。 In the common sparse component estimation step, a residual sum including M sparse components obtained by removing low rank components of the i-th iteration estimation i-1 from the amplitude spectrogram of the M channel is used as the common sparse component. The target acoustic signal restoration method according to claim 5, wherein a result obtained by dividing the sum of the content ratios including individual sparse components is estimated as the i-th repeated estimation i-th common sparse component.
  7.  前記共通スパース成分推定ステップは、
     前記Mチャネルの振幅スペクトログラムに含まれる低ランク成分の比率を演算する低ランク成分比率演算ステップと、
     前記低ランク成分の比率に基づいて前記Mチャネルの振幅スペクトログラムに含まれるM個の低ランク成分を演算する低ランク成分演算ステップと、
     前記Mチャネルの振幅スペクトログラムに含まれるスパース成分の比率を演算するスパース成分比率演算ステップと、
     前記スパース成分の比率に基づいて前記Mチャネルの振幅スペクトログラムに含まれるM個のスパース成分を含む残余成分を演算する残余成分演算ステップと、
     前記M個のスパース成分を含む残余成分と前記共通スパース成分とに基づいて、前記共通スパース成分に前記M個のスパース成分が含まれている含有比率を前記M個のスパース成分の音量として演算する音量演算ステップと、
     前記M個のスパース成分を含む残余成分の総和を前記M個のスパース成分の音量の総和で除算して前記共通スパース成分を演算する共通スパース成分演算ステップを備えて、
     前記低ランク成分比率演算ステップ、前記低ランク成分演算ステップ、前記スパース成分比率演算ステップ、前記残余成分演算ステップ、前記音量演算ステップ及び前記共通スパース成分演算ステップにおいて反復演算を行うことにより前記共通スパース成分を推定する請求項6に記載の音声信号復元方法。
    The common sparse component estimation step includes:
    A low rank component ratio calculating step of calculating a ratio of low rank components included in the amplitude spectrogram of the M channel;
    A low rank component calculation step of calculating M low rank components included in the amplitude spectrogram of the M channel based on the ratio of the low rank components;
    A sparse component ratio calculating step of calculating a ratio of sparse components included in the amplitude spectrogram of the M channel;
    A residual component calculating step of calculating a residual component including M sparse components included in the amplitude spectrogram of the M channel based on the ratio of the sparse components;
    Based on the residual component including the M sparse components and the common sparse component, a content ratio in which the M sparse components are included in the common sparse component is calculated as a volume of the M sparse components. A volume calculation step;
    A common sparse component calculation step of calculating the common sparse component by dividing the sum of the residual components including the M sparse components by the sum of the volume of the M sparse components;
    The common sparse component by performing an iterative operation in the low rank component ratio calculation step, the low rank component calculation step, the sparse component ratio calculation step, the residual component calculation step, the volume calculation step, and the common sparse component calculation step. The audio signal restoration method according to claim 6, wherein
  8.  M本(Mは2以上の整数)のマイクロホンによって採取したMチャネルの音響信号に含まれる目的音響信号を復元する目的音響信号復元方法を、コンピュータを用いて実現するためにコンピュータ読み取り可能な記憶手段に記憶されたコンピュータプログラムであって、
     前記M本(Mは2以上の整数)のマイクロホンによって採取した前記Mチャネルの音響信号を時間周波数解析してMチャネルの複素スペクトログラムを得る時間周波数解析ステップと、
     前記Mチャネルの複素スペクトログラムからMチャネルの振幅スペクトログラムを抽出する振幅成分抽出ステップと、
     前記Mチャネルの振幅スペクトログラムを入力として、前記Mチャネルの振幅スペクトログラムのうち最も多くのチャネルの振幅スペクトログラムに共通に含まれている可能性が高いスパース時間周波数成分を含む共通スパース成分を推定する共通スパース成分推定ステップと、
     前記共通スパース成分の位相を復元して目的音響複素スペクトログラムとする位相復元ステップと、
     前記目的音響複素スペクトログラムを時間信号である前記目的音響信号に変換する目的音響信号変換ステップを前記コンピュータで実現するための目的音響信号復元用コンピュータプログラム。
    A computer-readable storage means for realizing a target acoustic signal restoration method for restoring a target acoustic signal included in an M channel acoustic signal collected by M microphones (M is an integer of 2 or more) using a computer A computer program stored in
    A time-frequency analysis step of obtaining a M-channel complex spectrogram by performing time-frequency analysis of the M-channel acoustic signal collected by the M microphones (M is an integer of 2 or more);
    An amplitude component extracting step of extracting an M channel amplitude spectrogram from the M channel complex spectrogram;
    A common sparse component including a sparse time-frequency component that is likely to be commonly included in the amplitude spectrograms of the most channels among the amplitude spectrograms of the M channels, using the amplitude spectrogram of the M channels as an input. Component estimation step;
    A phase restoration step of restoring the phase of the common sparse component to obtain a target acoustic complex spectrogram;
    A computer program for restoring a target sound signal for realizing, in the computer, a target sound signal conversion step of converting the target sound complex spectrogram into the target sound signal which is a time signal.
PCT/JP2017/019259 2016-05-23 2017-05-23 System and method for target sound signal restoration WO2017204226A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2018519566A JP6886720B2 (en) 2016-05-23 2017-05-23 Objective Acoustic signal restoration system and method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2016102063 2016-05-23
JP2016-102063 2016-05-23

Publications (1)

Publication Number Publication Date
WO2017204226A1 true WO2017204226A1 (en) 2017-11-30

Family

ID=60412291

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/019259 WO2017204226A1 (en) 2016-05-23 2017-05-23 System and method for target sound signal restoration

Country Status (2)

Country Link
JP (1) JP6886720B2 (en)
WO (1) WO2017204226A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363532A (en) * 2021-12-02 2022-04-15 浙江大华技术股份有限公司 Focusing method and related device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015060375A1 (en) * 2013-10-23 2015-04-30 国立大学法人 長崎大学 Biological sound signal processing device, biological sound signal processing method, and biological sound signal processing program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015060375A1 (en) * 2013-10-23 2015-04-30 国立大学法人 長崎大学 Biological sound signal processing device, biological sound signal processing method, and biological sound signal processing program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YOSHIAKI BANDO ET AL.: "Henbun Bayes Ta-Channel RNMF ni Motosuku Junan Sakujo Rescue Robot no Tameno Onsei Kyocho", THE 34TH ANNUAL CONFERENCE OF THE ROBOTICS SOCIETY OF JAPAN YOKOSHU, 7 September 2016 (2016-09-07) *
YOSHIAKI BANDO ET AL.: "Junan Sakujo Rescue Robot no Tameno Robust Shuseibun Bunseki o Mochiita Soko Zatsuon Yokuatsu", PROCEEDINGS OF THE 77TH NATIONAL CONVENTION OF INFORMATION PROCESSING, 17 March 2015 (2015-03-17), pages 2-505 - 2-506 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363532A (en) * 2021-12-02 2022-04-15 浙江大华技术股份有限公司 Focusing method and related device

Also Published As

Publication number Publication date
JP6886720B2 (en) 2021-06-16
JPWO2017204226A1 (en) 2019-03-22

Similar Documents

Publication Publication Date Title
US9668066B1 (en) Blind source separation systems
JP4774100B2 (en) Reverberation removal apparatus, dereverberation removal method, dereverberation removal program, and recording medium
JP5124014B2 (en) Signal enhancement apparatus, method, program and recording medium
WO2019199554A1 (en) Multi-microphone speech separation
JP5231139B2 (en) Sound source extraction device
US20080294432A1 (en) Signal enhancement and speech recognition
CN111899756B (en) Single-channel voice separation method and device
CN108447498B (en) Speech enhancement method applied to microphone array
JPWO2016152511A1 (en) Sound source separation apparatus and method, and program
CN110047478B (en) Multi-channel speech recognition acoustic modeling method and device based on spatial feature compensation
JP2007526511A (en) Method and apparatus for blind separation of multipath multichannel mixed signals in the frequency domain
US20190198036A1 (en) Information processing apparatus, information processing method, and recording medium
JP6348427B2 (en) Noise removal apparatus and noise removal program
US11790930B2 (en) Method and system for dereverberation of speech signals
KR20220022286A (en) Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder
JP4348393B2 (en) Signal distortion removing apparatus, method, program, and recording medium recording the program
JP6448567B2 (en) Acoustic signal analyzing apparatus, acoustic signal analyzing method, and program
GB2510650A (en) Sound source separation based on a Binary Activation model
WO2017204226A1 (en) System and method for target sound signal restoration
CN112037813B (en) Voice extraction method for high-power target signal
JP2018036332A (en) Acoustic processing device, acoustic processing system and acoustic processing method
KR101658001B1 (en) Online target-speech extraction method for robust automatic speech recognition
US20230306980A1 (en) Method and System for Audio Signal Enhancement with Reduced Latency
Bando et al. Variational Bayesian multi-channel robust NMF for human-voice enhancement with a deformable and partially-occluded microphone array
Yoshioka et al. Dereverberation by using time-variant nature of speech production system

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2018519566

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17802812

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17802812

Country of ref document: EP

Kind code of ref document: A1