WO2014079484A1 - Method for determining a dictionary of base components from an audio signal - Google Patents

Method for determining a dictionary of base components from an audio signal Download PDF

Info

Publication number
WO2014079484A1
WO2014079484A1 PCT/EP2012/073149 EP2012073149W WO2014079484A1 WO 2014079484 A1 WO2014079484 A1 WO 2014079484A1 EP 2012073149 W EP2012073149 W EP 2012073149W WO 2014079484 A1 WO2014079484 A1 WO 2014079484A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
denotes
negative
base
dictionary
Prior art date
Application number
PCT/EP2012/073149
Other languages
French (fr)
Inventor
Cyril JODER
Felix WENNINGER
Bjoern Schuller
David Virette
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to EP12794680.4A priority Critical patent/EP2912660B1/en
Priority to PCT/EP2012/073149 priority patent/WO2014079484A1/en
Publication of WO2014079484A1 publication Critical patent/WO2014079484A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • the present invention relates to a method and a device for determining a dictionary of base components from an input signal.
  • the present invention relates to the processing of an acoustic signal input for the estimation of a feature vector dictionary for describing acoustic sources.
  • Audio signals are composed of a plurality of individual sound sources.
  • Music recordings for example, comprise most of the time several instruments.
  • the signal often comprises, in addition to the speech itself, other interfering sounds which are recorded by the same microphone.
  • interfering sounds can be for example ambient noise or other people talking in the same room.
  • Non-negative Matrix Factorisation has been first proposed by Paatero: "Least Squares Formulation of Robust Non-Negative Factor Analysis", Chemometrics and Intelligent Laboratory Systems 37, pp. 23-35, 1997 and has been successfully applied to a wide variety of applications since then.
  • this technique has become a standard method for audio source separation, where an input audio signal is to be separated into several signals corresponding to the different acoustb sources. It is based on a decomposition of the power spectrogram of the mixture into a non-negative combination of several spectral bases, each associated to one of the present sources.
  • Non-negative Matrix Factorization (NMF) methods have been used in that context with relatively good results.
  • the non-negative constraint which is inherent to this technique complies with the structure of the audio spectrograms, and can allow for the decomposition of a sound into some meaningful components. These components form a dictionary of spectral bases which describe the signal.
  • the decomposition typically aims to estimate spectral bases corresponding to different "parts" of the spectrogram, e.g.
  • the basic principle of NMF-based audio processing 100 as schematically illustrated in Fig. 1 is to find a locally optimal factorization of a short-time magnitude spectrogram V 103 of an audio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in the signal 101 and the second one H their activation over time.
  • the first factor W describes the component spectra of the source model 109.
  • the second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101.
  • the first factor W and the second factor H are matched with the short- time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure.
  • the source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF.
  • the source signal or signals 1 13 can be derived from the source spectrogram 1 1 1 .
  • the conventional formulation of NMF is defined as follows.
  • the matrix ⁇ defines a m x n matrix of non-negative real values.
  • the goal is to approximate this matrix by the product of two other non-negative matrices W £ KL xr and H E E. r + Xn , where r « m, n holds.
  • a cost function is minimized, measuring the so called "reconstruction error"
  • the input matrix V is given by the succession of short-time magnitude (or power) spectra of the input signal, each column of the matrix containing the values of the spectrum computed at a specific instance in time.
  • these features are given by a short-time Fourier transform of the input signal, after some window function is applied to it.
  • This matrix contains only non-negative values, because of the kind of features used.
  • the values of the matrices W and H which are estimated by the NMF are initialized by a random number generator and then updated by an iterative process.
  • the initial values can also be set according to some prior knowledge of the signal.
  • several decompositions are performed on successive mid-term windows of the signal as shown by C. Joder, F. Weninger, F. Eyben, D. Virette, B. Schuller: "Real-time Speech Separation by Semi- Supervised Nonnegative Matrix Factorization", Proc. of LVA/ICA 2012, Springer, p. 322- 329. Then, a faster convergence can be obtained by initializing the matrices according to the output of the previous decomposition.
  • some of the spectral basis can be set to a constant value, fixed by a prior learning. This can be beneficial if one of the sources is known and sufficient data is available to estimate the characteristic spectra of this source. In this case, the
  • the NMF decomposition is illustrated in Fig. 2 by a simple example.
  • the figure represents a spectrogram 201 represented by the matrix V, a matrix of two spectral bases 202 represented by the matrix W and the corresponding temporal weights 203 represented by the matrix H.
  • the greyscale of the spectrogram 201 represents the amplitude of the
  • the spectrogram defines an acoustic scene which can be described as the superposition of two so called "atomic sounds".
  • NMF a two-component NMF
  • the matrices W and H as defined in Fig. 2 can be obtained.
  • Each column of W can be interpreted as a basis function for the spectra contained in V, when weighted with the corresponding values of H.
  • spectral bases are non-negative, they correspond to proper magnitude spectra, which can then be used to reconstruct each of the so called "atomic sounds".
  • the example of Fig. 2 is simplistic; however the NMF method can provide satisfactory results in separating different sound sources from realistic recordings. In these cases, a larger value of the order of decomposition r is used. Then, each "component”, i.e. the product of one spectral basis with the corresponding temporal weights, is assigned to a specific source. The estimated spectrogram of each source is finally obtained by the sum of all the components attributed to the source.
  • the estimation of the dictionary of spectral bases often suffers from some inaccuracies and results in components representing several sources at the same time. Indeed, this method minimizes a reconstruction error between the original input and the decomposition, without taking into account the structure of the individual signals. As a result, the estimated bases can capture some unstructured so called "building blocks" which can be used to reconstruct several sources, whereas the goal is to match each basis to a specific source.
  • several modifications of the standard NMF method have been proposed, which impose a structure by favoring some properties of the decomposition, such as temporal continuity or component sparsity.
  • the sparsity property relates to the fact that the proportion of elements with non-zero value or, more generally, of non-negligible value is very small. In particular, the sparsity of the component activations is often enforced. This property relates to the fact that few components are active at the same time.
  • the spectrogram 300 corresponds to the succession of two musical notes, the second one having a pitch one octave higher than the first one.
  • the plots 301 and 302 are the respective spectrograms of these two notes.
  • the invention is based on the finding that sound source estimation is improved when a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF) is used for the factorization which identifies different components of an input signal.
  • WNMF Wiener entropy-constrained Non-negative Matrix Factorization
  • the features are decomposed into a sparse combination of non-negative feature bases.
  • the decomposition can be used to separate the input signal into several output signals corresponding to different components.
  • the obtained dictionary of feature bases can also be used to separate the corresponding components from another signal, by decomposing this other signal according to the elements of the dictionary.
  • aspects of the invention provide a novel method for enforcing a sparse decomposition, resulting in a dictionary of spectral bases which is more characteristic of the different parts of the signal, as will be presented in the following.
  • abbreviations and notations will be used: audio
  • NMF Non-negative matrix factorization
  • the vector 1-norm is the matrix norm of an m times n matrix A defined as the sum of the absolute values of its elements
  • the invention relates to a method for determining a dictionary of base components from an audio signal, the audio signal being represented by an input matrix which columns comprise features of the audio signal at different instances in time, the method comprising: decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, the decomposing being constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix, wherein components of the non-negative base matrix represent the dictionary of base components of the audio signal.
  • Wiener entropy constraint The decomposing being constrained by a Wiener entropy measure, also denoted as Wiener entropy constraint is a new constraint providing a novel method for enforcing sparsity.
  • Wiener entropy or spectral flatness measures how flat the vector is. It is used as a sparsity penalty for NMF. By using that measure meaningful spectral patterns are estimated, speech separation quality, measured by both signal-based and perceptual criteria is improved. Compared to standard NMF the complexity increase is limited.
  • the Wiener entropy constrained NMF (WNMF) can be integrated into any system using NMF.
  • the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal.
  • a specific audio source of a multi-source audio signal can be extracted from the noisy multi-source audio signal.
  • the decomposing is performed by using a Non-negative Matrix Factorization.
  • the Wiener entropy measure can thus be adapted to a standard NMF factorization with only little overhead thereby saving computational complexity.
  • the decomposing constrained by the Wiener entropy measure is configured to enforce a sparse decomposition of the non-negative base matrix.
  • the decomposing constrained by the Wiener entropy measure comprises: forming the non- negative weight matrix such that a Wiener entropy of each column of the non-negative weight matrix is close to zero.
  • the decomposing constrained by the Wiener entropy measure comprises: minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix by using a cost function.
  • the cost function is according to
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix with elements H j
  • ⁇ M- L denotes the vector 1- norm
  • the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication and the symbol ' denotes the element-wise division
  • 1 is a real non-negative parameter
  • the symbol e denotes a (small) positive real number
  • _-J e is defined by
  • Such a cost function provides an efficient reconstruction of the original signal.
  • the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
  • Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
  • the multiplicative update rule is according to:
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix with elements Hj
  • is a real non-negative parameter
  • the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication
  • I m n denotes a matrix of dimension m x n whose elements are all equal to one
  • A denotes a matrix of dimension r x n, defined by:
  • G denotes a matrix of dimension r x n, defined by:
  • the cost function is according to
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix with elements H j
  • ⁇ M -L denotes the vector 1- norm
  • the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication and the symbol ' denotes the element-wise division
  • 1 is a real non-negative parameter
  • the symbol e denotes a (small) positive real number
  • _-J e is defined by
  • Such a cost function provides an efficient reconstruction of the original signal and a homogeneous estimation of the components, regardless of the amplitude of the original signal
  • the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
  • Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
  • the multiplicative update rule is according to:
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix with elements H j
  • is a real non-negative parameter
  • the symbol 0 denotes the Hadamard product, i.e. element-wise multiplication
  • I m n denotes a matrix of dimension m x n whose elements are all equal to one
  • A denotes a matrix of dimension r x n, defined by:
  • G denotes a matrix of dimension r x n, defined by:
  • the method comprises: reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix, the non-negative base matrix and the non- negative weight matrix.
  • magnitude spectrograms S k of the plurality of output signals are determined according to:
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix
  • k denotes the column-vector constituted by the k-th column of W
  • H k denotes the row-vector constituted by the k-th row of H
  • the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication.
  • the output signals can be superposed in order to obtain signals corresponding to the combination of several components, for source separation applications.
  • magnitude spectrograms S k of the plurality of output signals are determined by a product of a column-vector W. k constituted by the k-th column of the non-negative base matrix W and a row-vector H k . constituted by the k-th row of the non-negative weight matrix H.
  • the method comprises: constructing output spectrograms by summing several of the magnitude spectrograms S k of the plurality of output signals.
  • the method comprises: determining a dictionary of base components from a training speech signal according to the first aspect as such or according to any of the preceding implementation forms of the first aspect; and forming a non-negative base matrix of a noisy speech signal by extending the non-negative base matrix of the training speech signal; and updating the non-negative base matrix of the noisy speech signal by using a semi-supervised Non-negative Matrix Factorization.
  • source separation is improved as a speech signal which is not corrupted by noise is used for determining the dictionary of base components.
  • the method comprises: reconstructing the speech signal based on the updated non-negative base matrix of the noisy speech signal.
  • the invention relates to a device for determining a dictionary of base components from an input signal represented by an input matrix, the device comprising: a buffer for storing the input matrix; and means for decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix represent the dictionary of base components of the input signal.
  • Wiener entropy measure meaningful spectral patterns are estimated and thus, speech separation quality, measured by both signal-based and perceptual criteria is improved.
  • the complexity increase is not significant when compared to standard NMF implementations.
  • the Wiener entropy constrained NMF can be integrated into any device using NMF.
  • the methods and systems described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC).
  • DSP Digital Signal Processor
  • ASIC application specific integrated circuit
  • the means for decomposing the input matrix may be implemented as software in a Digital Signal
  • DSP DSP
  • ASIC application specific integrated circuit
  • the invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware or software of conventional mobile devices or hands-free communication systems or in new hardware or software dedicated for processing the methods described herein after.
  • aspects of the invention provide a method for decomposing a signal according to a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF).
  • WNMF Wiener entropy-constrained Non-negative Matrix Factorization
  • NMF Non-negative Matrix Factorization
  • the Wiener entropy also called “spectral flatness" of a set of non-negative values is the ratio between the geometric mean and the arithmetic mean of these values.
  • the Wiener entropy is always between zero and one, and it is equal to one if and only if all tie values in the set are equal. Intuitively, a large value of the Wiener entropy corresponds to a "flat” plot and a small value corresponds to a "peaky” plot.
  • the penalty term used to measure the sparsity of the decomposition is given by a weighted sum of the Wiener entropy values of the columns of the matrix H.
  • the cost function to be minimized is then
  • H t j is the value in the i-th row and j-th column of the matrix H.
  • is a real non- negative parameter and the parameters ⁇ 3 ⁇ 4 ⁇ are non-negative weighting parameters, which can depend on the matrix H.
  • e denotes a (small) positive real number and the operator
  • _-J e is defined by
  • the reconstruction error is defined as
  • Hadamard product i.e. element-wise multiplication and : ' is the element-wise division.
  • the weighting parameters ⁇ used in the above defined cost function are all set to one.
  • the optimization of this cost function is performed by multiplicative update rules, which enforce non-negativity without needing explicit constraints.
  • a and G are two matrices of dimensions r x n, defined by:
  • I m n is a matrix of dimensions m x n whose elements are all equal to one.
  • the optimization process stops when convergence is observed or when a sufficient number of iteration has been performed.
  • gradient-descent algorithms are applied instead of these multiplicative updates.
  • the weighting parameters ⁇ used in the above defined cost function are all set to the mean value of the corresponding columns of the matrix H.
  • the sparsity penalty applied to each instance in time is approximately proportional to the amplitude of the input signal at the corresponding instance in time.
  • the optimization of this cost function is performed by multiplicative update rules.
  • the updates of the decomposition are performed according to:
  • Another advantage of this setting is that the complexity of the parameter updates reduced compared to the previous implementation.
  • Fig. 1 shows a schematic diagram 100 of a conventional non-negative Matrix
  • NMF Factorization
  • Fig. 2 shows three schematic diagrams 201 , 202, 203 representing V, W and H matrices of a conventional Non-negative Matrix Factorization decomposition
  • Fig. 3 shows exemplary spectrograms of two musical notes 301 , 302, a succession of the two musical notes 300 and reconstructions 303, 304 of the two musical notes
  • Fig. 4 shows a schematic diagram of a method 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form
  • Fig. 5 shows a schematic diagram of a method 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form
  • Fig. 6 shows a schematic diagram of a method 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form
  • Fig. 7 shows a schematic diagram of a de-noising system 700 according to an
  • Fig. 8 shows a schematic diagram of a device 800 for determining a dictionary of base components 804 from an audio signal 802 according to an implementation form.
  • Fig. 4 shows a schematic diagram of a method 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form.
  • the method 440 performs a WNMF decomposition 400 from a digital single-channel acoustic signal 401 .
  • the digital input signal 401 is input to a short-time transform module 410, which performs a windowing into short-time frames and a transform, so as to produce non-negative feature vectors 41 1 , e.g. magnitude spectra.
  • a buffer 420 stores these features in order to produce the matrix V 421 .
  • the WNMF module 430 then performs a decomposition of the matrix V 421 , representing the magnitude spectra of the input signal.
  • the outputs of this module are the matrices W 431 and H 432 which represent respectively the dictionary of feature bases and the temporal weights of these bases.
  • Fig. 5 shows a schematic diagram of a system 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form.
  • the system 500 comprises a factorization element 400 performing the WNMF decomposition 400 as described above with respect to Fig. 4 and a reconstruction element 510.
  • the factorization element 400 takes as input an acoustic signal 401 and estimates a dictionary of feature bases 431 and the corresponding temporal weights 432 describing the signal. The result of the
  • the reconstruction module 510 which produces several output signals 51 1 , 512 and 513.
  • the reconstruction module 510 exploits a so-called "soft mask” approach as described in the following.
  • a magnitude spectrogram S k is calculated as:
  • the three obtained matrices constitute the magnitude spectrograms of the three output signals.
  • the time-domain signal are then obtained by a standard approach, involving an inverse Fourier transform exploiting the phase of the original complex spectrogram, followed by an overlap-add procedure.
  • the output signals are then superposed in order to obtain signals corresponding to the combination of several components, for a source separation application.
  • the components of the system 500 described above may also be implemented as steps of a method.
  • Fig. 6 shows a schematic diagram of a system 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form.
  • the decomposition is applied to the reduction of noise in a noisy speech signal.
  • This system 600 involves a prior training phase 610 which comprises a factorization element 400 performing the WNMF decompositbn 400 as described above with respect to Fig. 4.
  • a training speech signal 601 is input to the factorization element 400, which computes a dictionary of feature bases 61 1 and a matrix of temporal weights 612 corresponding to the WNMF decomposition of the training signal.
  • the system 600 further comprises a short-time transform 630, a buffer 640, a semi- supervised NMF module 650 and a reconstruction module 660.
  • a single-channel noisy speech signal 621 undergoes a short-time transform 630 which calculates non-negative features 631 , similarly to the element 410 described above with respect to Fig. 4.
  • the buffer 640 stores these features to produce a matrix V 641 .
  • This matrix undergoes a decomposition using semi-supervised NMF 650, where the feature bases corresponding to speech are set to the values of the dictionary 61 1 given by the training phase.
  • the other bases are updated by the semi-supervised NMF.
  • the outputs of this decomposition 650 are the dictionary W 651 and the corresponding weights H 652. These matrices are used by a reconstruction element 660, which produces the de-noised speech signal.
  • H is the matrix extracted from the matrix H 652 comprising the weights corresponding to the speech bases W s .
  • the magnitude spectrogram S of the de-noised output signal is calculated as:
  • the time-domain signal is then obtained by the same approach as described above with respect to Fig. 5 for the reconstruction element 510.
  • the semi-supervised NMF 650 is replaced by a WNMF decomposition 400 as described above with respect to Fig. 4.
  • a noise training phase similar to 610 is performed to estimate a noise feature dictionary from a training recording of noise.
  • the dictionary W 651 is defined as the concatenation of the speech dictionary 61 1 and the noise dictionary, and the semi-supervised NMF 650 is replaced by a supervised NMF.
  • Fig. 7 shows a schematic diagram of a de-noising system 700 according to an
  • spectral components Weaker 713 and W noiS e 715 are estimated from clean speech V sp eaker 701 and noise V noi se 703 separately using WNMF 707, 709. These spectral components ⁇ N speaker 713 and W noi se 715 are fed to a de-noising system 71 1 , which exploits them to separate speech from noise.
  • the noise components are estimated on the noisy speech V mix 705 without noise training by the de-noising system 71 1 which provides the de-noised speech 717.
  • the de-noising system 71 1 is a supervised system. In an implementation form the de-noising system 71 1 is a semi-supervised system. In an implementation form the de-noising system 71 1 is an unsupervised NMF de-noising system where no a priori knowledge of the speech and noise models is available.
  • Fig. 8 shows a schematic diagram of a device 800 for determining a dictionary of base components 804 from an input signal 802 according to an implementation form.
  • the input signal 802 is represented by an input matrix V.
  • the device 800 comprises a buffer 803 for storing the input matrix V.
  • the device 800 further comprises decomposing means 801 for decomposing the input matrix V into a product of a non-negative base matrix W and a non-negative weight matrix H, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix W represent the dictionary of base components 804 of the input signal 802.
  • the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal.
  • the decomposing means is configured for decomposing the input matrix V by using a Non- negative Matrix Factorization.
  • the decomposing means is configured to enforce a sparse decomposition of the non-negative base matrix W.
  • the decomposing means comprises means for forming the non- negative weight matrix H such that a Wiener entropy of each column of the non-negative weight matrix H is close to zero.
  • the decomposing means comprises means for minimizing a sum of Wiener entropy values of columns of the non- negative weight matrix H by using a cost function.
  • the cost function is according to where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements H j , the operation
  • the device
  • the multiplicative update rule is according to:
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix with elements ⁇
  • the symbol denotes the Hadamard product, i.e. element-wise multiplication
  • l m n denotes a matrix of dimension m x n whose elements are all equal to one
  • A denotes a matrix of dimension r x n, defined by:
  • G denotes a matrix of dimension r x n, defined by:
  • the device 800 comprises means for reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix V, the non-negative base matrix W and the non-negative weight matrix H.
  • magnitude spectrograms S k of the plurality of output signals are determined according to:
  • V denotes the input matrix
  • W denotes the non-negative base matrix
  • H denotes the non-negative weight matrix
  • W. k denotes the column-vector constituted by the k-th column of W and H k . denotes the row-vector constituted by the k-th row of H and the symbol denotes the Hadamard product, i.e. element-wise multiplication.
  • magnitude spectrograms S k of the plurality of output signals are determined by a product of a column-vector W. k constituted by the k-th column of the non-negative base matrix W and a row-vector H k . constituted by the k-th row of the non- negative weight matrix H.
  • the device 800 comprises means for determining a dictionary of base components from a training speech signal according to the method 400 as described above with respect to Fig. 4; and means for forming a non-negative base matrix W of a noisy speech signal by using a semi-supervised Non-negative Matrix Factorization; and means for updating the non-negative base matrix W of the noisy speech signal with the non-negative base matrix W s of the training speech signal.
  • the device 800 comprises means for reconstructing the speech signal based on the updated non-negative base matrix W of the noisy speech signal.
  • the decomposing means comprises means for minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix H by using a cost function, the weighting parameters of the sum being the mean values of the columns of the m n is according to
  • the device 800 comprises means for updating the cost function by multiplicative update rule according to:
  • the present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a method (440) for determining a dictionary of base components (431) from an audio signal (401), the audio signal (401) being represented by an input matrix (V) which columns comprise features of the audio signal (401) at different instances in time, the method (440) comprising: decomposing (430) the input matrix (V) into a product of a non-negative base matrix (W) and a non-negative weight matrix (H), the decomposing (430) being constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix (H), wherein components of the non-negative base matrix (W) represent the dictionary of base components (431) of the audio signal (401).

Description

DESCRIPTION
Method for determining a dictionary of base components from an audio signal
BACKGROUND OF THE INVENTION
The present invention relates to a method and a device for determining a dictionary of base components from an input signal. In particular, the present invention relates to the processing of an acoustic signal input for the estimation of a feature vector dictionary for describing acoustic sources.
Most audio signals are composed of a plurality of individual sound sources. Musical recordings, for example, comprise most of the time several instruments. In the case of speech communication, the signal often comprises, in addition to the speech itself, other interfering sounds which are recorded by the same microphone. Such interfering sounds can be for example ambient noise or other people talking in the same room.
Several applications would take advantage of the separation of audio signal into several parts. One of them is the reduction of acoustic noise in telephonic communication, especially in the case of hand-free system where the noise level is often high because of the distance between the microphone and the speaker. Another usage of source separation is the extraction of some target instrument from musical signals, for karaoke or remixing application.
Non-negative Matrix Factorisation (NMF) has been first proposed by Paatero: "Least Squares Formulation of Robust Non-Negative Factor Analysis", Chemometrics and Intelligent Laboratory Systems 37, pp. 23-35, 1997 and has been successfully applied to a wide variety of applications since then. In particular, this technique has become a standard method for audio source separation, where an input audio signal is to be separated into several signals corresponding to the different acoustb sources. It is based on a decomposition of the power spectrogram of the mixture into a non-negative combination of several spectral bases, each associated to one of the present sources. Non-negative Matrix Factorization (NMF) methods have been used in that context with relatively good results. Indeed, the non-negative constraint which is inherent to this technique complies with the structure of the audio spectrograms, and can allow for the decomposition of a sound into some meaningful components. These components form a dictionary of spectral bases which describe the signal. The decomposition typically aims to estimate spectral bases corresponding to different "parts" of the spectrogram, e.g.
different sounds or speakers. A separation of these parts can then be performed by a partial reconstruction of the signal, considering only the wanted components. This technique has been applied by C. Joder, F. Weninger, F. Eyben, D. Virette, B.
Schuller "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix
Factorization", Proc. International Conference on Latent Variable Analysis and Signal Separation, March 2012, in particular, to the separation of a target speaker from noisy recordings.
The basic principle of NMF-based audio processing 100 as schematically illustrated in Fig. 1 is to find a locally optimal factorization of a short-time magnitude spectrogram V 103 of an audio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in the signal 101 and the second one H their activation over time. The first factor W describes the component spectra of the source model 109. The second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101. The first factor W and the second factor H are matched with the short- time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure. The source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF. The source signal or signals 1 13 can be derived from the source spectrogram 1 1 1 . The conventional formulation of NMF is defined as follows. The matrix^ defines a m x n matrix of non-negative real values. The goal is to approximate this matrix by the product of two other non-negative matrices W £ KL xr and H E E.r + Xn, where r « m, n holds. In mathematical terms, a cost function is minimized, measuring the so called "reconstruction error"
D (y, W - H\
where the term D describes some distance or divergence function. When processing sounds, the input matrix V is given by the succession of short-time magnitude (or power) spectra of the input signal, each column of the matrix containing the values of the spectrum computed at a specific instance in time. In general, these features are given by a short-time Fourier transform of the input signal, after some window function is applied to it. This matrix contains only non-negative values, because of the kind of features used. Typically, the values of the matrices W and H which are estimated by the NMF are initialized by a random number generator and then updated by an iterative process.
However, the initial values can also be set according to some prior knowledge of the signal. In particular for an implementation in an on-line system, several decompositions are performed on successive mid-term windows of the signal as shown by C. Joder, F. Weninger, F. Eyben, D. Virette, B. Schuller: "Real-time Speech Separation by Semi- Supervised Nonnegative Matrix Factorization", Proc. of LVA/ICA 2012, Springer, p. 322- 329. Then, a faster convergence can be obtained by initializing the matrices according to the output of the previous decomposition.
Similarly, some of the spectral basis can be set to a constant value, fixed by a prior learning. This can be beneficial if one of the sources is known and sufficient data is available to estimate the characteristic spectra of this source. In this case, the
corresponding columns of W are not updated. The methods wherein the matrix W is entirely constant during the decomposition and the method in which the matrix W is entirely updated are called supervised NMF and unsupervised NMF, respectively. In the case where only a part of the spectral basis is updated, the method is called semi- supervised NMF.
The NMF decomposition is illustrated in Fig. 2 by a simple example. The figure represents a spectrogram 201 represented by the matrix V, a matrix of two spectral bases 202 represented by the matrix W and the corresponding temporal weights 203 represented by the matrix H. The greyscale of the spectrogram 201 represents the amplitude of the
Fourier coefficients. The spectrogram defines an acoustic scene which can be described as the superposition of two so called "atomic sounds". By applying a two-component NMF to this spectrogram, the matrices W and H as defined in Fig. 2 can be obtained. Each column of W can be interpreted as a basis function for the spectra contained in V, when weighted with the corresponding values of H.
Since the spectral bases are non-negative, they correspond to proper magnitude spectra, which can then be used to reconstruct each of the so called "atomic sounds". The example of Fig. 2 is simplistic; however the NMF method can provide satisfactory results in separating different sound sources from realistic recordings. In these cases, a larger value of the order of decomposition r is used. Then, each "component", i.e. the product of one spectral basis with the corresponding temporal weights, is assigned to a specific source. The estimated spectrogram of each source is finally obtained by the sum of all the components attributed to the source.
However, in the conventional NMF method, the estimation of the dictionary of spectral bases often suffers from some inaccuracies and results in components representing several sources at the same time. Indeed, this method minimizes a reconstruction error between the original input and the decomposition, without taking into account the structure of the individual signals. As a result, the estimated bases can capture some unstructured so called "building blocks" which can be used to reconstruct several sources, whereas the goal is to match each basis to a specific source. In order to overcome this problem, several modifications of the standard NMF method have been proposed, which impose a structure by favoring some properties of the decomposition, such as temporal continuity or component sparsity.
The sparsity property relates to the fact that the proportion of elements with non-zero value or, more generally, of non-negligible value is very small. In particular, the sparsity of the component activations is often enforced. This property relates to the fact that few components are active at the same time.
A simple example of the usefulness of a sparsity constraint is represented in Fig. 3. The spectrogram 300 corresponds to the succession of two musical notes, the second one having a pitch one octave higher than the first one. The plots 301 and 302 are the respective spectrograms of these two notes. However, without any constraint on the structure of the decomposition, an NMF factorization with order r=2 applied to the spectrogram 300 can also result in the estimation of the spectrograms 303 and 304, since they yield the same perfect reconstruction of the original signal. Enforcing the sparsity property would favor the first decomposition.
This constraint is generally achieved by adding a penalty term in the cost function to be minimized. The cost function then becomes
D {V, W H) + f(H)
where 1 is a real non-negative parameter and / is a function measuring the sparsity of the matrix H. The use of the "pure sparsity measure", that is the number of positive elements in the decomposition, as a penalty in the NMF generally leads to an intractable problem because of its lack of regularity. Thus, the common practice is to approximate this measure with the L1 norm, also called the Manhattan distance according to A. Cichocki, R. Zdunek, S. Amari, "New Algorithms for Non-negative Matrix Factorization in Application to Blind Source Separation", Proc. of IEEE ICASSP 2006. Other variants of this criterion have also been employed, such as a normalized version of the L1 norm according to T. Virtanen, "Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria", IEEE Trans, on Audio, Speech and Signal Process., vol. 15(3), pp. 1066-1074, 2007 or the ratio between the L1 and the L2 norm according to P. Hoyer, "Non-negative Matrix Factorization with Sparseness Constraints", Journal of Machine Learning Research, Vol. 5, pp. 1457-1469, 2004.
SUMMARY OF THE INVENTION It is the object of the invention to provide a concept for improving sound source estimation when using Non-Negative Matrix Factorization decompositions.
This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
The invention is based on the finding that sound source estimation is improved when a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF) is used for the factorization which identifies different components of an input signal. Applying this technique to non-negative features describing an input signal, such as magnitude spectra, the features are decomposed into a sparse combination of non-negative feature bases. The decomposition can be used to separate the input signal into several output signals corresponding to different components. The obtained dictionary of feature bases can also be used to separate the corresponding components from another signal, by decomposing this other signal according to the elements of the dictionary.
Aspects of the invention provide a novel method for enforcing a sparse decomposition, resulting in a dictionary of spectral bases which is more characteristic of the different parts of the signal, as will be presented in the following. In order to describe the invention in detail, the following terms, abbreviations and notations will be used: audio
a reproduction technique capable of creating spatial sound fields in an extended area by means of loudspeakers or loudspeaker arrays,
NMF: Non-negative matrix factorization,
WNMF: Wiener entropy-constrained Non-negative Matrix Factorization
Vector
1 -norm: The vector 1-norm is the matrix norm of an m times n matrix A defined as the sum of the absolute values of its elements,
m n i=l j=l
Hadamard
product: The Hadamard product is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element /)' is the product of elements ij of the original two matrices. According to a first aspect, the invention relates to a method for determining a dictionary of base components from an audio signal, the audio signal being represented by an input matrix which columns comprise features of the audio signal at different instances in time, the method comprising: decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, the decomposing being constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix, wherein components of the non-negative base matrix represent the dictionary of base components of the audio signal.
The decomposing being constrained by a Wiener entropy measure, also denoted as Wiener entropy constraint is a new constraint providing a novel method for enforcing sparsity. The Wiener entropy or spectral flatness measures how flat the vector is. It is used as a sparsity penalty for NMF. By using that measure meaningful spectral patterns are estimated, speech separation quality, measured by both signal-based and perceptual criteria is improved. Compared to standard NMF the complexity increase is limited. The Wiener entropy constrained NMF (WNMF) can be integrated into any system using NMF.
In a first possible implementation form of the method according to the first aspect, the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal.
Thus, a specific audio source of a multi-source audio signal can be extracted from the noisy multi-source audio signal.
In a second possible implementation form of the method according to the first aspect as such or according to the first implementation form of the first aspect, the decomposing is performed by using a Non-negative Matrix Factorization. The Wiener entropy measure can thus be adapted to a standard NMF factorization with only little overhead thereby saving computational complexity.
In a third possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the decomposing constrained by the Wiener entropy measure is configured to enforce a sparse decomposition of the non-negative base matrix.
Computing with sparse matrices improves speed and reduces complexity. In a fourth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of thefirst aspect, the decomposing constrained by the Wiener entropy measure comprises: forming the non- negative weight matrix such that a Wiener entropy of each column of the non-negative weight matrix is close to zero.
By that specific forming of the H matrix, reconstruction of the original signal is improved.
In a fifth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the decomposing constrained by the Wiener entropy measure comprises: minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix by using a cost function.
By using a cost function iterative or recursive adaptations can be applied which are computational efficient. Reconstruction of the original signal is improved.
In a sixth possible implementation form of the method according to the fifth
implementation form of the first aspect, the cost function is according to
Figure imgf000010_0001
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, the operation || M-L denotes the vector 1- norm, the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication and the symbol ' denotes the element-wise division, 1 is a real non-negative parameter, the symbol e denotes a (small) positive real number and the operator |_-Je is defined by
\x\e = max(x, e).
Such a cost function provides an efficient reconstruction of the original signal.
In a seventh possible implementation form of the method according to the fifth
implementation form or according to the sixth implementation form of the first aspect, the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution. In an eighth possible implementation form of the method according to the seventh implementation form of the first aspect, the multiplicative update rule is according to:
«- = η- 8 ¾ and H = H ® W'
"*'" W1 · '>».» + λ ¥Α ® Ή
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, λ is a real non-negative parameter, the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication, Im n denotes a matrix of dimension m x n whose elements are all equal to one, A denotes a matrix of dimension r x n, defined by:
Figure imgf000011_0001
and G denotes a matrix of dimension r x n, defined by:
Figure imgf000011_0002
where the symbol e denotes a positive real number and the operator |_-Je is defined by
[x\e = max(x, e).
These multiplicative update rules are easy to implement and fast converging.
In a ninth possible implementation form of the method according to the fifth
implementation form of the first aspect, the cost function is according to
Figure imgf000011_0003
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, the operation || M -L denotes the vector 1- norm, the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication and the symbol ' denotes the element-wise division, 1 is a real non-negative parameter, the symbol e denotes a (small) positive real number and the operator |_-Je is defined by
\x\e = max(x, e).
Such a cost function provides an efficient reconstruction of the original signal and a homogeneous estimation of the components, regardless of the amplitude of the original signal
In a tenth possible implementation form of the method according to the ninth
implementation form or according to the sixth implementation form of the first aspect, the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm. Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
In an eleventh possible implementation form of the method according to the tenth implementation form of the first aspect, the multiplicative update rule is according to:
Figure imgf000012_0001
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, λ is a real non-negative parameter, the symbol 0 denotes the Hadamard product, i.e. element-wise multiplication, Im n denotes a matrix of dimension m x n whose elements are all equal to one, A denotes a matrix of dimension r x n, defined by:
Figure imgf000012_0002
and G denotes a matrix of dimension r x n, defined by:
Figure imgf000012_0003
where the symbol e denotes a positive real number and the operator L-Je is defined by
\x\e = max(x, e).
These multiplicative update rules are easy to implement and fast converging. In a twelfth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the method comprises: reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix, the non-negative base matrix and the non- negative weight matrix.
The reconstructed signals are noise-reduced and they indicate the source components of the original audio signal. In a thirteenth possible implementation form of the method according to the twelfth implementation form of the first aspect, magnitude spectrograms Sk of the plurality of output signals are determined according to:
= _ W.Mk. - Hk . φ
k W - H
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix, W. k denotes the column-vector constituted by the k-th column of W and Hk . denotes the row-vector constituted by the k-th row of H and the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication.
The output signals can be superposed in order to obtain signals corresponding to the combination of several components, for source separation applications.
In a fourteenth possible implementation form of the method according to the twelfth implementation form of the first aspect, magnitude spectrograms Sk of the plurality of output signals are determined by a product of a column-vector W. k constituted by the k-th column of the non-negative base matrix W and a row-vector Hk . constituted by the k-th row of the non-negative weight matrix H.
When the output signals are directly reconstructed, computational complexity is reduced. In a fifteenth possible implementation form of the method according to the thirteenth implementation form or according to the fourteenth implementation form of the first aspect, the method comprises: constructing output spectrograms by summing several of the magnitude spectrograms Sk of the plurality of output signals. In a sixteenth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the method comprises: determining a dictionary of base components from a training speech signal according to the first aspect as such or according to any of the preceding implementation forms of the first aspect; and forming a non-negative base matrix of a noisy speech signal by extending the non-negative base matrix of the training speech signal; and updating the non-negative base matrix of the noisy speech signal by using a semi-supervised Non-negative Matrix Factorization. When using a training speech signal, source separation is improved as a speech signal which is not corrupted by noise is used for determining the dictionary of base components. In a seventeenth possible implementation form of the method according to the sixteenth implementation form of the first aspect, the method comprises: reconstructing the speech signal based on the updated non-negative base matrix of the noisy speech signal.
Accuracy of the reconstruction is improved when the reconstruction is based on the updated base matrix of the noisy speech signal.
According to a second aspect, the invention relates to a device for determining a dictionary of base components from an input signal represented by an input matrix, the device comprising: a buffer for storing the input matrix; and means for decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix represent the dictionary of base components of the input signal. By using the Wiener entropy measure, meaningful spectral patterns are estimated and thus, speech separation quality, measured by both signal-based and perceptual criteria is improved. The complexity increase is not significant when compared to standard NMF implementations. The Wiener entropy constrained NMF can be integrated into any device using NMF.
The methods and systems described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC). The means for decomposing the input matrix may be implemented as software in a Digital Signal
Processor (DSP), in a micro-controller or in any other side-processor or as a hardware unit, e.g. within an application specific integrated circuit (ASIC).
The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware or software of conventional mobile devices or hands-free communication systems or in new hardware or software dedicated for processing the methods described herein after.
Aspects of the invention provide a method for decomposing a signal according to a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF). This method brings a modification to Non-negative Matrix Factorization (NMF) to enforce a sparse decomposition of a non-negative matrix.
The Wiener entropy, also called "spectral flatness", of a set of non-negative values is the ratio between the geometric mean and the arithmetic mean of these values. The Wiener entropy is always between zero and one, and it is equal to one if and only if all tie values in the set are equal. Intuitively, a large value of the Wiener entropy corresponds to a "flat" plot and a small value corresponds to a "peaky" plot. Hence, to enforce the sparsity property of the NMF decomposition, it has to be ensured that the Wener entropy of each column of H is small.
In the WNMF method, the penalty term used to measure the sparsity of the decomposition is given by a weighted sum of the Wiener entropy values of the columns of the matrix H. The cost function to be minimized is then
Figure imgf000015_0001
where Ht j is the value in the i-th row and j-th column of the matrix H. λ is a real non- negative parameter and the parameters <¾ are non-negative weighting parameters, which can depend on the matrix H. The symbol e denotes a (small) positive real number and the operator |_-Je is defined by
x\e = max(x, e).
This maximum ensures that the penalty term is positive, and that sparsity is enforced even when one of the weights is equal to zero. A variety of functions can be used for measuring the reconstruction error. In an implementation form, the reconstruction error is defined as
D V, W H) V ® \n— - V + W - H
^ W H 1
where the operation IH^ denotes the vector 1-norm, the symbol (g) denotes the
Hadamard product, i.e. element-wise multiplication and : ' is the element-wise division. In an implementation form, the weighting parameters ω; used in the above defined cost function are all set to one.
In an implementation form, the optimization of this cost function is performed by multiplicative update rules, which enforce non-negativity without needing explicit constraints. In an implementation form, A and G are two matrices of dimensions r x n, defined by:
and
The updates of the
Figure imgf000016_0001
where Im n is a matrix of dimensions m x n whose elements are all equal to one. The optimization process stops when convergence is observed or when a sufficient number of iteration has been performed.
In an alternative implementation form, gradient-descent algorithms are applied instead of these multiplicative updates.
In an alternative implementation form, the weighting parameters ω; used in the above defined cost function are all set to the mean value of the corresponding columns of the matrix H.
Figure imgf000016_0002
Hence, the sparsity penalty applied to each instance in time is approximately proportional to the amplitude of the input signal at the corresponding instance in time.
This ensures that the relative orders of magnitude of the sparsity term and the
reconstruction error term are homogeneous over time. Thus, the relative importance of both constraints does not depend on the amplitude of the input signal. In this case, the cost function simplifies to:
Figure imgf000017_0001
In an implementation the optimization of this cost function is performed by multiplicative update rules. The updates of the decomposition are performed according to:
W = W and H = H ® W ' H
Figure imgf000017_0002
Another advantage of this setting is that the complexity of the parameter updates reduced compared to the previous implementation.
BRIEF DESCRIPTION OF THE DRAWINGS
Further embodiments of the invention will be described with respect to the following figures, in which:
Fig. 1 shows a schematic diagram 100 of a conventional non-negative Matrix
Factorization (NMF) technique;
Fig. 2 shows three schematic diagrams 201 , 202, 203 representing V, W and H matrices of a conventional Non-negative Matrix Factorization decomposition; Fig. 3 shows exemplary spectrograms of two musical notes 301 , 302, a succession of the two musical notes 300 and reconstructions 303, 304 of the two musical notes
reconstructed by using a conventional NMF factorization;
Fig. 4 shows a schematic diagram of a method 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form;
Fig. 5 shows a schematic diagram of a method 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form; Fig. 6 shows a schematic diagram of a method 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form;
Fig. 7 shows a schematic diagram of a de-noising system 700 according to an
implementation form; and
Fig. 8 shows a schematic diagram of a device 800 for determining a dictionary of base components 804 from an audio signal 802 according to an implementation form.
DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION Fig. 4 shows a schematic diagram of a method 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form.
The method 440 performs a WNMF decomposition 400 from a digital single-channel acoustic signal 401 . The digital input signal 401 is input to a short-time transform module 410, which performs a windowing into short-time frames and a transform, so as to produce non-negative feature vectors 41 1 , e.g. magnitude spectra. A buffer 420 stores these features in order to produce the matrix V 421 . The WNMF module 430 then performs a decomposition of the matrix V 421 , representing the magnitude spectra of the input signal. The outputs of this module are the matrices W 431 and H 432 which represent respectively the dictionary of feature bases and the temporal weights of these bases.
Fig. 5 shows a schematic diagram of a system 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form. The system 500 is adapted for separating a single-channel acoustic signal into several components (here r = 3). The system 500 comprises a factorization element 400 performing the WNMF decomposition 400 as described above with respect to Fig. 4 and a reconstruction element 510. The factorization element 400 takes as input an acoustic signal 401 and estimates a dictionary of feature bases 431 and the corresponding temporal weights 432 describing the signal. The result of the
decomposition is input to the reconstruction module 510, which produces several output signals 51 1 , 512 and 513. In an implementation form, the reconstruction module 510 exploits a so-called "soft mask" approach as described in the following. For k = 1 ... 3, W. k is the column-vector constituted by the k-th column of W and Hk . is the row-vector constituted by the k-th row of H. A magnitude spectrogram Sk is calculated as:
The three obtained matrices constitute the magnitude spectrograms of the three output signals. The time-domain signal are then obtained by a standard approach, involving an inverse Fourier transform exploiting the phase of the original complex spectrogram, followed by an overlap-add procedure.
In an implementation form, the output signals are then superposed in order to obtain signals corresponding to the combination of several components, for a source separation application.
In another implementation form, the magnitude spectrogram of the output signals are directly reconstructed as Sk = W. k Hk ..
The components of the system 500 described above may also be implemented as steps of a method.
Fig. 6 shows a schematic diagram of a system 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form. The decomposition is applied to the reduction of noise in a noisy speech signal. This system 600 involves a prior training phase 610 which comprises a factorization element 400 performing the WNMF decompositbn 400 as described above with respect to Fig. 4. In the training phase, a training speech signal 601 is input to the factorization element 400, which computes a dictionary of feature bases 61 1 and a matrix of temporal weights 612 corresponding to the WNMF decomposition of the training signal. The system 600 further comprises a short-time transform 630, a buffer 640, a semi- supervised NMF module 650 and a reconstruction module 660. A single-channel noisy speech signal 621 undergoes a short-time transform 630 which calculates non-negative features 631 , similarly to the element 410 described above with respect to Fig. 4. The buffer 640 stores these features to produce a matrix V 641 . This matrix undergoes a decomposition using semi-supervised NMF 650, where the feature bases corresponding to speech are set to the values of the dictionary 61 1 given by the training phase. The other bases are updated by the semi-supervised NMF. The outputs of this decomposition 650 are the dictionary W 651 and the corresponding weights H 652. These matrices are used by a reconstruction element 660, which produces the de-noised speech signal.
The reconstruction is performed by the "soft mask" method. In an implementation form, H is the matrix extracted from the matrix H 652 comprising the weights corresponding to the speech bases Ws. The magnitude spectrogram S of the de-noised output signal is calculated as:
Ws Hs
S =— ® V.
W · H
The time-domain signal is then obtained by the same approach as described above with respect to Fig. 5 for the reconstruction element 510.
In an implementation form, the semi-supervised NMF 650 is replaced by a WNMF decomposition 400 as described above with respect to Fig. 4.
In yet another implementation form, a noise training phase similar to 610 is performed to estimate a noise feature dictionary from a training recording of noise. In this case, the dictionary W 651 is defined as the concatenation of the speech dictionary 61 1 and the noise dictionary, and the semi-supervised NMF 650 is replaced by a supervised NMF.
The components of the system 600 described above may also be implemented as steps of a method. Fig. 7 shows a schematic diagram of a de-noising system 700 according to an
implementation form. In a training phase, spectral components Weaker 713 and WnoiSe 715 are estimated from clean speech Vspeaker 701 and noise Vnoise 703 separately using WNMF 707, 709. These spectral components \Nspeaker 713 and Wnoise 715 are fed to a de-noising system 71 1 , which exploits them to separate speech from noise. The noise components are estimated on the noisy speech Vmix 705 without noise training by the de-noising system 71 1 which provides the de-noised speech 717.
In an implementation form, the de-noising system 71 1 is a supervised system. In an implementation form the de-noising system 71 1 is a semi-supervised system. In an implementation form the de-noising system 71 1 is an unsupervised NMF de-noising system where no a priori knowledge of the speech and noise models is available.
Fig. 8 shows a schematic diagram of a device 800 for determining a dictionary of base components 804 from an input signal 802 according to an implementation form. The input signal 802 is represented by an input matrix V. The device 800 comprises a buffer 803 for storing the input matrix V. The device 800 further comprises decomposing means 801 for decomposing the input matrix V into a product of a non-negative base matrix W and a non-negative weight matrix H, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix W represent the dictionary of base components 804 of the input signal 802.
In an implementation form, the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal. In an implementation form, the decomposing means is configured for decomposing the input matrix V by using a Non- negative Matrix Factorization. In an implementation form, the decomposing means is configured to enforce a sparse decomposition of the non-negative base matrix W. In an implementation form, the decomposing means comprises means for forming the non- negative weight matrix H such that a Wiener entropy of each column of the non-negative weight matrix H is close to zero. In an implementation form, the decomposing means comprises means for minimizing a sum of Wiener entropy values of columns of the non- negative weight matrix H by using a cost function. In an implementation form, the cost function is according to
Figure imgf000021_0001
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, the operation || M -L denotes the vector 1- norm, the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication and the symbol j denotes the element-wise division. In an implementation form, the device
800 comprises means for updating the cost function by one of a multiplicative update rule and a gradient descent algorithm. In an implementation form, the multiplicative update rule is according to:
Figure imgf000022_0001
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Η,,,, the symbol denotes the Hadamard product, i.e. element-wise multiplication, lm n denotes a matrix of dimension m x n whose elements are all equal to one, A denotes a matrix of dimension r x n, defined by:
Figure imgf000022_0002
and G denotes a matrix of dimension r x n, defined by:
Figure imgf000022_0003
In an implementation form, the device 800 comprises means for reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix V, the non-negative base matrix W and the non-negative weight matrix H. In an implementation form, magnitude spectrograms Sk of the plurality of output signals are determined according to:
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix, W. k denotes the column-vector constituted by the k-th column of W and Hk . denotes the row-vector constituted by the k-th row of H and the symbol denotes the Hadamard product, i.e. element-wise multiplication. In an implementation form, magnitude spectrograms Sk of the plurality of output signals are determined by a product of a column-vector W. k constituted by the k-th column of the non-negative base matrix W and a row-vector Hk . constituted by the k-th row of the non- negative weight matrix H.
In an implementation form, the device 800 comprises means for determining a dictionary of base components from a training speech signal according to the method 400 as described above with respect to Fig. 4; and means for forming a non-negative base matrix W of a noisy speech signal by using a semi-supervised Non-negative Matrix Factorization; and means for updating the non-negative base matrix W of the noisy speech signal with the non-negative base matrix Ws of the training speech signal. In an implementation form, the device 800 comprises means for reconstructing the speech signal based on the updated non-negative base matrix W of the noisy speech signal.
In another implementation form, the decomposing means comprises means for minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix H by using a cost function, the weighting parameters of the sum being the mean values of the columns of the m n is according to
Figure imgf000023_0001
In an implementation form, the device 800 comprises means for updating the cost function by multiplicative update rule according to:
V V
W = W ® W · H H
and H = H ® W - H
^τη,η w1■ I m,n + < Λ 2
From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided.
The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.
The present disclosure also supports a system configured to execute the performing and computing steps described herein. Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present inventions has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the spirit and scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the inventions may be practiced otherwise than as specifically described herein.

Claims

CLAIMS:
1 . A method (440) for determining a dictionary of base components (431 ) from an audio signal (401 ), the audio signal (401 ) being represented by an input matrix (V) which columns comprise features of the audio signal (401 ) at different instances in time, the method (440) comprising: decomposing (430) the input matrix (V) into a product of a non-negative base matrix (W) and a non-negative weight matrix (H), the decomposing (430) being
constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix (H), wherein components of the non-negative base matrix (W) represent the dictionary of base components (431 ) of the audio signal (401 ).
2. The method (440) of claim 1 , wherein the dictionary of base components (431 ) represents a specific audio source of a plurality of audio sources of the audio signal (401 ).
3. The method (440) of claim 1 or claim 2, wherein the decomposing (430) uses a Non-negative Matrix Factorization.
4. The method (440) of one of the preceding claims, wherein the decomposing (430) constrained by the Wiener entropy measure is configured to enforce a sparse
decomposition of the non-negative base matrix (W).
5. The method (440) of one of the preceding claims, wherein the decomposing (430) constrained by the Wiener entropy measure comprises:
forming the non-negative weight matrix (H) such that a Wiener entropy of each column of the non-negative weight matrix (H) is close to zero.
6. The method (440) of one of the preceding claims, wherein the decomposing (430) constrained by the Wiener entropy measure comprises:
minimizing a weighted sum of Wiener entropy values of columns of the non- negative weight matrix (H) by using a cost function.
7. The method (440) of claim 6, wherein the cost function is according to
Figure imgf000026_0001
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, the operation || denotes the vector 1- norm, the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication, the symbol ' denotes the element-wise division , λ is a real non-negative parameter, ω; denote non-negative weighting parameters which can depend on the matrix H, the symbol e denotes a positive real number and the operator |_-Je is defined as
\x\e = max(x, e).
8. The method (440) of claim 6 or claim 7, comprising: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
9. The method (440) of claim 8, wherein the multiplicative update rule is according to:
Figure imgf000026_0002
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, λ is a real non-negative parameter, the symbol denotes the Hadamard product, i.e. element-wise multiplication, Im n denotes a matrix of dimension m x n whose elements are all equal to one, A denotes a matrix of dimension r x n, defined by:
r fc=l
and G denotes a matrix of dimension r x n, defined by:
Figure imgf000026_0003
where the symbol e denotes a positive real number and the operator is defined as
[x\e = max(x, e).
The method (440) of claim 8, wherein the multiplica accord
Figure imgf000027_0001
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix with elements Hj, λ is a real non-negative parameter, the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication, Im n denotes a matrix of dimension m x n whose elements are all equal to one, G denotes a matrix of dimension r x n, defined by:
where the symbol e denotes a positive real number and the operator |_-Je is defined as
\x\e = max(x, e).
1 1 . The method (500) of one of the preceding claims, comprising:
reconstructing (510) a plurality of output signals (51 1 , 512, 513) from the audio signal (401 ), the reconstruction (510) being based on the input matrix (V), the non- negative base matrix (W) and the non-negative weight matrix (H).
12. The method (500) of claim 1 1 , wherein magnitude spectrograms Sk of the plurality of output signals (51 1 , 512, 513) are determined according to:
where V denotes the input matrix, W denotes the non-negative base matrix, H denotes the non-negative weight matrix, W. k denotes the column-vector constituted by the k-th column of W and Hk . denotes the row-vector constituted by the k-th row of H and the symbol (g) denotes the Hadamard product, i.e. element-wise multiplication; or wherein magnitude spectrograms Sk of the plurality of output signals (51 1 , 512, 513) are determined by a product of a column-vector W. k constituted by the k-th column of the non-negative base matrix W and a row-vector Hk . constituted by the k-th row of the non-negative weight matrix H.
13. The method (500) of claim 12, comprising:
constructing output spectrograms by summing several of the magnitude spectrograms Sk of the plurality of output signals (51 1 , 512, 513).
14. The method (600) of one of the preceding claims, comprising: determining (610) a dictionary of base components (61 1 ) from a training speech signal (601 ) according to one of the methods 1 to 13; and forming (651 ) a non-negative base matrix (W) of a noisy speech signal (621 ) by extending the non-negative base matrix (Ws) of the training speech signal (601 ); and updating the non-negative base matrix (W) of the noisy speech signal (621 ) by using a semi-supervised Non-negative Matrix Factorization.
15. Device (800) for determining a dictionary of base components (804) from an input signal (802) represented by an input matrix (V), the device (800) comprising:
a buffer (803) for storing the input matrix (V); and means (801 ) for decomposing the input matrix (V) into a product of a non-negative base matrix (W) and a non-negative weight matrix (H), wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix (W) represent the dictionary of base components (804) of the input signal (802).
PCT/EP2012/073149 2012-11-21 2012-11-21 Method for determining a dictionary of base components from an audio signal WO2014079484A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP12794680.4A EP2912660B1 (en) 2012-11-21 2012-11-21 Method for determining a dictionary of base components from an audio signal
PCT/EP2012/073149 WO2014079484A1 (en) 2012-11-21 2012-11-21 Method for determining a dictionary of base components from an audio signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2012/073149 WO2014079484A1 (en) 2012-11-21 2012-11-21 Method for determining a dictionary of base components from an audio signal

Publications (1)

Publication Number Publication Date
WO2014079484A1 true WO2014079484A1 (en) 2014-05-30

Family

ID=47278271

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/073149 WO2014079484A1 (en) 2012-11-21 2012-11-21 Method for determining a dictionary of base components from an audio signal

Country Status (2)

Country Link
EP (1) EP2912660B1 (en)
WO (1) WO2014079484A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105976806A (en) * 2016-04-26 2016-09-28 西南交通大学 Active noise control method based on maximum entropy
WO2017143095A1 (en) * 2016-02-16 2017-08-24 Red Pill VR, Inc. Real-time adaptive audio source separation
WO2017217396A1 (en) * 2016-06-16 2017-12-21 日本電気株式会社 Signal processing device, signal processing method, and computer-readable recording medium
WO2017217412A1 (en) * 2016-06-16 2017-12-21 日本電気株式会社 Signal processing device, signal processing method, and computer-readable recording medium
JP2018072664A (en) * 2016-11-01 2018-05-10 日本電信電話株式会社 Signal analyzer, method, and program
WO2018149133A1 (en) * 2017-02-17 2018-08-23 深圳大学 Method and system for face recognition by means of dictionary learning based on kernel non-negative matrix factorization, and sparse feature representation
CN110428848A (en) * 2019-06-20 2019-11-08 西安电子科技大学 A kind of sound enhancement method based on the prediction of public space speech model
CN111009256A (en) * 2019-12-17 2020-04-14 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
CN111179960A (en) * 2020-03-06 2020-05-19 北京松果电子有限公司 Audio signal processing method and device and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829481B (en) * 2019-01-04 2020-10-30 北京邮电大学 Image classification method and device, electronic equipment and readable storage medium

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
A. CICHOCKI; R. ZDUNEK; S. AMARI: "New Algorithms for Non-negative Matrix Factorization in Application to Blind Source Separation", PROC. OF IEEE ICASSP, 2006
C. JODER; F. WENINGER; F. EYBEN; D. VIRETTE; B. SCHULLER: "Proc. of LVA/ICA", 2012, SPRINGER, article "ReaI-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", pages: 322 - 329
C. JODER; F. WENINGER; F. EYBEN; D. VIRETTE; B.SCHULLER: "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", PROC. INTERNATIONAL CONFERENCE ON LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION, March 2012 (2012-03-01)
MANUEL MOUSSALLAM ET AL: "Audio source separation informed by redundancy with greedy multiscale decompositions", SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2012 PROCEEDINGS OF THE 20TH EUROPEAN, IEEE, 27 August 2012 (2012-08-27), pages 2644 - 2648, XP032254797, ISBN: 978-1-4673-1068-0 *
NASSER MOHAMMADIHA ET AL: "Nonnegative matrix factorization using projected gradient algorithms with sparseness constraints", SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2009 IEEE INTERNATIONAL SYMPOSIUM ON, IEEE, PISCATAWAY, NJ, USA, 14 December 2009 (2009-12-14), pages 418 - 423, XP031624854, ISBN: 978-1-4244-5949-0 *
P. HOYER: "Non-negative Matrix Factorization with Sparseness Constraints", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 5, 2004, pages 1457 - 1469
PAATERO: "Least Squares Formulation of Robust Non-Negative Factor Analysis", CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, vol. 37, 1997, pages 23 - 35, XP004255943, DOI: doi:10.1016/S0169-7439(96)00044-5
ROBERT PEHARZ ET AL: "Sparse nonnegative matrix factorization with -constraints", NEUROCOMPUTING, vol. 80, 15 March 2012 (2012-03-15), pages 38 - 46, XP028356707, ISSN: 0925-2312, [retrieved on 20111111], DOI: 10.1016/J.NEUCOM.2011.09.024 *
T. VIRTANEN: "Monaural Sound Source Separation by Nonnegative Matrix Factorization with Temporal Continuity and Sparseness Criteria", IEEE TRANS. ON AUDIO, SPEECH AND SIGNAL PROCESS., vol. 15, no. 3, 2007, pages 1066 - 1074, XP011165565, DOI: doi:10.1109/TASL.2006.885253

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017143095A1 (en) * 2016-02-16 2017-08-24 Red Pill VR, Inc. Real-time adaptive audio source separation
CN105976806B (en) * 2016-04-26 2019-08-02 西南交通大学 Active noise control method based on maximum entropy
CN105976806A (en) * 2016-04-26 2016-09-28 西南交通大学 Active noise control method based on maximum entropy
US10679646B2 (en) 2016-06-16 2020-06-09 Nec Corporation Signal processing device, signal processing method, and computer-readable recording medium
US10817719B2 (en) 2016-06-16 2020-10-27 Nec Corporation Signal processing device, signal processing method, and computer-readable recording medium
JP7006592B2 (en) 2016-06-16 2022-01-24 日本電気株式会社 Signal processing equipment, signal processing methods and signal processing programs
JPWO2017217396A1 (en) * 2016-06-16 2019-04-11 日本電気株式会社 Signal processing apparatus, signal processing method, and signal processing program
JPWO2017217412A1 (en) * 2016-06-16 2019-04-18 日本電気株式会社 Signal processing apparatus, signal processing method and signal processing program
WO2017217412A1 (en) * 2016-06-16 2017-12-21 日本電気株式会社 Signal processing device, signal processing method, and computer-readable recording medium
WO2017217396A1 (en) * 2016-06-16 2017-12-21 日本電気株式会社 Signal processing device, signal processing method, and computer-readable recording medium
JP2018072664A (en) * 2016-11-01 2018-05-10 日本電信電話株式会社 Signal analyzer, method, and program
WO2018149133A1 (en) * 2017-02-17 2018-08-23 深圳大学 Method and system for face recognition by means of dictionary learning based on kernel non-negative matrix factorization, and sparse feature representation
CN110428848A (en) * 2019-06-20 2019-11-08 西安电子科技大学 A kind of sound enhancement method based on the prediction of public space speech model
CN110428848B (en) * 2019-06-20 2021-10-29 西安电子科技大学 Speech enhancement method based on public space speech model prediction
CN111009256A (en) * 2019-12-17 2020-04-14 北京小米智能科技有限公司 Audio signal processing method and device, terminal and storage medium
EP3839951A1 (en) * 2019-12-17 2021-06-23 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for processing audio signal, terminal and storage medium
US11284190B2 (en) 2019-12-17 2022-03-22 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and device for processing audio signal with frequency-domain estimation, and non-transitory computer-readable storage medium
CN111179960A (en) * 2020-03-06 2020-05-19 北京松果电子有限公司 Audio signal processing method and device and storage medium

Also Published As

Publication number Publication date
EP2912660A1 (en) 2015-09-02
EP2912660B1 (en) 2017-01-11

Similar Documents

Publication Publication Date Title
EP2912660B1 (en) Method for determining a dictionary of base components from an audio signal
US9536538B2 (en) Method and device for reconstructing a target signal from a noisy input signal
CN111418012B (en) Method for processing an audio signal and audio processing device
Hassan et al. A comparative study of blind source separation for bioacoustics sounds based on FastICA, PCA and NMF
WO2016100460A1 (en) Systems and methods for source localization and separation
JP6334895B2 (en) Signal processing apparatus, control method therefor, and program
Mohammadiha et al. Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling
Pandey et al. Monoaural Audio Source Separation Using Variational Autoencoders.
Duong et al. An interactive audio source separation framework based on non-negative matrix factorization
Martín-Doñas et al. Dual-channel DNN-based speech enhancement for smartphones
Nesta et al. Robust Automatic Speech Recognition through On-line Semi Blind Signal Extraction
Li et al. Multichannel online dereverberation based on spectral magnitude inverse filtering
GB2510650A (en) Sound source separation based on a Binary Activation model
Kantamaneni et al. Speech enhancement with noise estimation and filtration using deep learning models
Şimşekli et al. Non-negative tensor factorization models for Bayesian audio processing
JP6448567B2 (en) Acoustic signal analyzing apparatus, acoustic signal analyzing method, and program
Duong et al. Gaussian modeling-based multichannel audio source separation exploiting generic source spectral model
Leglaive et al. Student's t source and mixing models for multichannel audio source separation
Saleem et al. Variance based time-frequency mask estimation for unsupervised speech enhancement
Li et al. Blind monaural singing voice separation using rank-1 constraint robust principal component analysis and vocal activity detection
Baby Supervised speech dereverberation in noisy environments using exemplar-based sparse representations
Borgstrom et al. A unified framework for designing optimal STSA estimators assuming maximum likelihood phase equivalence of speech and noise
JP2015049406A (en) Acoustic signal analyzing device, method, and program
Ibarrola et al. A Bayesian approach to convolutive nonnegative matrix factorization for blind speech dereverberation
CN116052702A (en) Kalman filtering-based low-complexity multichannel dereverberation noise reduction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12794680

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2012794680

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2012794680

Country of ref document: EP