EP2912660B1 - Method for determining a dictionary of base components from an audio signal - Google Patents
Method for determining a dictionary of base components from an audio signal Download PDFInfo
- Publication number
- EP2912660B1 EP2912660B1 EP12794680.4A EP12794680A EP2912660B1 EP 2912660 B1 EP2912660 B1 EP 2912660B1 EP 12794680 A EP12794680 A EP 12794680A EP 2912660 B1 EP2912660 B1 EP 2912660B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- matrix
- denotes
- negative
- symbol
- base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims description 76
- 230000005236 sound signal Effects 0.000 title claims description 33
- 239000011159 matrix material Substances 0.000 claims description 183
- 238000000354 decomposition reaction Methods 0.000 claims description 38
- 239000013598 vector Substances 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 description 28
- 238000000926 separation method Methods 0.000 description 18
- 230000003595 spectral effect Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 12
- 238000001228 spectrum Methods 0.000 description 10
- 230000002123 temporal effect Effects 0.000 description 7
- 239000000203 mixture Substances 0.000 description 5
- 238000005457 optimization Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000001994 activation Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 102000016550 Complement Factor H Human genes 0.000 description 2
- 108010053085 Complement Factor H Proteins 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000005534 acoustic noise Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010561 standard procedure Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
Definitions
- the present invention relates to a method and a device for determining a dictionary of base components from an input signal.
- the present invention relates to the processing of an acoustic signal input for the estimation of a feature vector dictionary for describing acoustic sources.
- Audio signals are composed of a plurality of individual sound sources.
- Music recordings for example, comprise most of the time several instruments.
- the signal often comprises, in addition to the speech itself, other interfering sounds which are recorded by the same microphone.
- interfering sounds can be for example ambient noise or other people talking in the same room.
- Non-negative Matrix Factorisation has been first proposed by Paatero: "Least Squares Formulation of Robust Non-Negative Factor Analysis", Chemometrics and Intelligent Laboratory Systems 37, pp. 23-35, 1997 and has been successfully applied to a wide variety of applications since then.
- this technique has become a standard method for audio source separation, where an input audio signal is to be separated into several signals corresponding to the different acoustic sources. It is based on a decomposition of the power spectrogram of the mixture into a non-negative combination of several spectral bases, each associated to one of the present sources.
- Non-negative Matrix Factorization (NMF) methods have been used in that context with relatively good results.
- the non-negative constraint which is inherent to this technique complies with the structure of the audio spectrograms, and can allow for the decomposition of a sound into some meaningful components.
- These components form a dictionary of spectral bases which describe the signal.
- the decomposition typically aims to estimate spectral bases corresponding to different "parts" of the spectrogram, e.g. different sounds or speakers. A separation of these parts can then be performed by a partial reconstruction of the signal, considering only the wanted components.
- This technique has been applied by C. Joder, F. Weninger, F. Eyben, D. Virette, B. Schuller "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", Proc. International Conference on Latent Variable Analysis and Signal Separation, March 2012 , in particular, to the separation of a target speaker from noisy recordings.
- the basic principle of NMF-based audio processing 100 as schematically illustrated in Fig. 1 is to find a locally optimal factorization of a short-time magnitude spectrogram V 103 of an audio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in the signal 101 and the second one H their activation over time.
- the first factor W describes the component spectra of the source model 109.
- the second factor H describes the activations 107 of the signal spectrogram 103 of the audio signal 101.
- the first factor W and the second factor H are matched with the short-time magnitude spectrogram V 103 of the audio signal 101 by an optimization procedure.
- the source model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for the source model 109 when using unsupervised NMF.
- the source signal or signals 113 can be derived from the source spectrogram 111.
- the conventional formulation of NMF is defined as follows.
- the matrix V defines a m ⁇ n matrix of non-negative real values.
- the goal is to approximate this matrix by the product of two other non-negative matrices W ⁇ R + m ⁇ r and H ⁇ R + r ⁇ n , where r ⁇ m , n holds.
- a cost function is minimized, measuring the so called "reconstruction error" D V , W ⁇ H , where the term D describes some distance or divergence function.
- the input matrix V is given by the succession of short-time magnitude (or power) spectra of the input signal, each column of the matrix containing the values of the spectrum computed at a specific instance in time.
- these features are given by a short-time Fourier transform of the input signal, after some window function is applied to it.
- This matrix contains only non-negative values, because of the kind of features used.
- the values of the matrices W and H which are estimated by the NMF are initialized by a random number generator and then updated by an iterative process.
- the initial values can also be set according to some prior knowledge of the signal.
- several decompositions are performed on successive mid-term windows of the signal as shown by C. Joder, F. Weninger, F. Eyben, D. Virette, B. Schuller: "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", Proc. of LVA/ICA 2012, Springer, p. 322-329 . Then, a faster convergence can be obtained by initializing the matrices according to the output of the previous decomposition.
- some of the spectral basis can be set to a constant value, fixed by a prior learning. This can be beneficial if one of the sources is known and sufficient data is available to estimate the characteristic spectra of this source. In this case, the corresponding columns of W are not updated.
- the methods wherein the matrix W is entirely constant during the decomposition and the method in which the matrix W is entirely updated are called supervised NMF and unsupervised NMF, respectively. In the case where only a part of the spectral basis is updated, the method is called semi-supervised NMF.
- the NMF decomposition is illustrated in Fig. 2 by a simple example.
- the figure represents a spectrogram 201 represented by the matrix V, a matrix of two spectral bases 202 represented by the matrix W and the corresponding temporal weights 203 represented by the matrix H.
- the greyscale of the spectrogram 201 represents the amplitude of the Fourier coefficients.
- the spectrogram defines an acoustic scene which can be described as the superposition of two so called "atomic sounds".
- the matrices W and H as defined in Fig. 2 can be obtained.
- Each column of W can be interpreted as a basis function for the spectra contained in V, when weighted with the corresponding values of H.
- spectral bases are non-negative, they correspond to proper magnitude spectra, which can then be used to reconstruct each of the so called "atomic sounds".
- the example of Fig. 2 is simplistic; however the NMF method can provide satisfactory results in separating different sound sources from realistic recordings. In these cases, a larger value of the order of decomposition r is used. Then, each "component”, i.e. the product of one spectral basis with the corresponding temporal weights, is assigned to a specific source. The estimated spectrogram of each source is finally obtained by the sum of all the components attributed to the source.
- the estimation of the dictionary of spectral bases often suffers from some inaccuracies and results in components representing several sources at the same time. Indeed, this method minimizes a reconstruction error between the original input and the decomposition, without taking into account the structure of the individual signals. As a result, the estimated bases can capture some unstructured so called "building blocks" which can be used to reconstruct several sources, whereas the goal is to match each basis to a specific source.
- several modifications of the standard NMF method have been proposed, which impose a structure by favoring some properties of the decomposition, such as temporal continuity or component sparsity.
- the sparsity property relates to the fact that the proportion of elements with non-zero value or, more generally, of non-negligible value is very small. In particular, the sparsity of the component activations is often enforced. This property relates to the fact that few components are active at the same time.
- FIG. 3 A simple example of the usefulness of a sparsity constraint is represented in Fig. 3 .
- the spectrogram 300 corresponds to the succession of two musical notes, the second one having a pitch one octave higher than the first one.
- the plots 301 and 302 are the respective spectrograms of these two notes.
- Audio source separation informed by redundancy with greedy multiscale decompositions (Munuel Moussallam et al, 2012-08-27, pages 2644-2648 , XP032254797) describes an algorithm for audio source separation of repeated musical patterns.
- a Time-Frequency mask usually based on the power spectral density of the mixtures is constructed for the repeating musical background and the separation is performed by means of Wiener filtering relative to this mask.
- the invention is based on the finding that sound source estimation is improved when a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF) is used for the factorization which identifies different components of an input signal.
- WNMF Wiener entropy-constrained Non-negative Matrix Factorization
- the features are decomposed into a sparse combination of non-negative feature bases.
- the decomposition can be used to separate the input signal into several output signals corresponding to different components.
- the obtained dictionary of feature bases can also be used to separate the corresponding components from another signal, by decomposing this other signal according to the elements of the dictionary.
- aspects of the invention provide a novel method for enforcing a sparse decomposition, resulting in a dictionary of spectral bases which is more characteristic of the different parts of the signal, as will be presented in the following.
- the invention relates to a method for determining a dictionary of base components from an audio signal, the audio signal being represented by an input matrix which columns comprise features of the audio signal at different instances in time, the method comprising: decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, the decomposing being constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix, wherein components of the non-negative base matrix represent the dictionary of base components of the audio signal.
- Wiener entropy constraint The decomposing being constrained by a Wiener entropy measure, also denoted as Wiener entropy constraint is a new constraint providing a novel method for enforcing sparsity.
- Wiener entropy or spectral flatness measures how flat the vector is. It is used as a sparsity penalty for NMF. By using that measure meaningful spectral patterns are estimated, speech separation quality, measured by both signal-based and perceptual criteria is improved. Compared to standard NMF the complexity increase is limited.
- the Wiener entropy constrained NMF (WNMF) can be integrated into any system using NMF.
- the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal.
- a specific audio source of a multi-source audio signal can be extracted from the noisy multi-source audio signal.
- the decomposing is performed by using a Non-negative Matrix Factorization.
- Wiener entropy measure can thus be adapted to a standard NMF factorization with only little overhead thereby saving computational complexity.
- the decomposing constrained by the Wiener entropy measure is configured to enforce a sparse decomposition of the non-negative base matrix.
- the decomposing constrained by the Wiener entropy measure comprises: forming the non-negative weight matrix such that a Wiener entropy of each column of the non-negative weight matrix is close to zero.
- the decomposing constrained by the Wiener entropy measure comprises: minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix by using a cost function.
- ⁇ 1 / r 1 r ⁇ i 1 r H i , j
- V denotes the input matrix
- W denotes the non-negative base matrix
- H denotes the non-negative weight matrix with elements H i,j
- the operation ⁇ 1 denotes the vector 1-norm
- the symbol ⁇ denotes the Hadamard product, i.e.
- Such a cost function provides an efficient reconstruction of the original signal.
- the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
- Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
- V denotes the input matrix
- W denotes the non-negative base matrix
- H denotes the non-negative weight matrix with elements H i,j
- ⁇ is a real non-negative parameter
- the symbol ⁇ denotes the Hadamard product, i.e.
- V denotes the input matrix
- W denotes the non-negative base matrix
- H denotes the non-negative weight matrix with elements H i,j
- the operation ⁇ 1 denotes the vector 1-norm
- the symbol ⁇ denotes the Hadamard product, i.e.
- Such a cost function provides an efficient reconstruction of the original signal and a homogeneous estimation of the components, regardless of the amplitude of the original signal
- the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
- Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
- the method comprises: reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix, the non-negative base matrix and the non-negative weight matrix.
- the reconstructed signals are noise-reduced and they indicate the source components of the original audio signal.
- the output signals can be superposed in order to obtain signals corresponding to the combination of several components, for source separation applications.
- magnitude spectrograms S k of the plurality of output signals are determined by a product of a column-vector W :, k constituted by the k-th column of the non-negative base matrix W and a row-vector H k ,: constituted by the k-th row of the non-negative weight matrix H.
- the method comprises: constructing output spectrograms by summing several of the magnitude spectrograms S k of the plurality of output signals.
- the method comprises: determining a dictionary of base components from a training speech signal according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, and forming a non-negative base matrix of a noisy speech signal by extending the non-negative base matrix of the training speech signal; and updating the non-negative base matrix of the noisy speech signal by using a semi-supervised Non-negative Matrix Factorization.
- the method comprises: reconstructing the speech signal based on the updated non-negative base matrix of the noisy speech signal.
- the invention relates to a device for determining a dictionary of base components from an input signal represented by an input matrix, the device comprising: a buffer for storing the input matrix; and means for decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix represent the dictionary of base components of the input signal.
- Wiener entropy measure By using the Wiener entropy measure, meaningful spectral patterns are estimated and thus, speech separation quality, measured by both signal-based and perceptual criteria is improved. The complexity increase is not significant when compared to standard NMF implementations.
- the Wiener entropy constrained NMF can be integrated into any device using NMF.
- the methods and systems described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC).
- the means for decomposing the input matrix may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as a hardware unit, e.g. within an application specific integrated circuit (ASIC).
- DSP Digital Signal Processor
- ASIC application specific integrated circuit
- the invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware or software of conventional mobile devices or hands-free communication systems or in new hardware or software dedicated for processing the methods described herein after.
- aspects of the invention provide a method for decomposing a signal according to a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF).
- WNMF Wiener entropy-constrained Non-negative Matrix Factorization
- NMF Non-negative Matrix Factorization
- the Wiener entropy also called “spectral flatness" of a set of non-negative values is the ratio between the geometric mean and the arithmetic mean of these values.
- the Wiener entropy is always between zero and one, and it is equal to one if and only if all he values in the set are equal.
- a large value of the Wiener entropy corresponds to a "flat” plot and a small value corresponds to a "peaky” plot.
- the penalty term used to measure the sparsity of the decomposition is given by a weighted sum of the Wiener entropy values of the columns of the matrix H.
- ⁇ is a real non-negative parameter and the parameters ⁇ j are non-negative weighting parameters, which can depend on the matrix H.
- the weighting parameters ⁇ j used in the above defined cost function are all set to one.
- the optimization process stops when convergence is observed or when a sufficient number of iteration has been performed.
- gradient-descent algorithms are applied instead of these multiplicative updates.
- the weighting parameters ⁇ j used in the above defined cost function are all set to the mean value of the corresponding columns of the matrix H .
- the sparsity penalty applied to each instance in time is approximately proportional to the amplitude of the input signal at the corresponding instance in time.
- the optimization of this cost function is performed by multiplicative update rules.
- Another advantage of this setting is that the complexity of the parameter updates is reduced compared to the previous implementation.
- Fig. 4 shows a schematic diagram of a method 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form.
- the method 440 performs a WNMF decomposition 400 from a digital single-channel acoustic signal 401.
- the digital input signal 401 is input to a short-time transform module 410, which performs a windowing into short-time frames and a transform, so as to produce non-negative feature vectors 411, e.g. magnitude spectra.
- a buffer 420 stores these features in order to produce the matrix V 421.
- the WNMF module 430 then performs a decomposition of the matrix V 421, representing the magnitude spectra of the input signal.
- the outputs of this module are the matrices W 431 and H 432 which represent respectively the dictionary of feature bases and the temporal weights of these bases.
- Fig. 5 shows a schematic diagram of a system 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form.
- the system 500 comprises a factorization element 400 performing the WNMF decomposition 400 as described above with respect to Fig. 4 and a reconstruction element 510.
- the factorization element 400 takes as input an acoustic signal 401 and estimates a dictionary of feature bases 431 and the corresponding temporal weights 432 describing the signal.
- the result of the decomposition is input to the reconstruction module 510, which produces several output signals 511, 512 and 513.
- the reconstruction module 510 exploits a so-called "soft mask” approach as described in the following.
- Fork 1 ... 3, W :, k is the column-vector constituted by the k-th column of W and H k ,: is the row-vector constituted by the k-th row of H.
- the three obtained matrices constitute the magnitude spectrograms of the three output signals.
- the time-domain signal are then obtained by a standard approach, involving an inverse Fourier transform exploiting the phase of the original complex spectrogram, followed by an overlap-add procedure.
- the output signals are then superposed in order to obtain signals corresponding to the combination of several components, for a source separation application.
- the components of the system 500 described above may also be implemented as steps of a method.
- Fig. 6 shows a schematic diagram of a system 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form.
- the decomposition is applied to the reduction of noise in a noisy speech signal.
- This system 600 involves a prior training phase 610 which comprises a factorization element 400 performing the WNMF decomposition 400 as described above with respect to Fig. 4 .
- a training speech signal 601 is input to the factorization element 400, which computes a dictionary of feature bases 611 and a matrix of temporal weights 612 corresponding to the WNMF decomposition of the training signal.
- the system 600 further comprises a short-time transform 630, a buffer 640, a semi-supervised NMF module 650 and a reconstruction module 660.
- a single-channel noisy speech signal 621 undergoes a short-time transform 630 which calculates non-negative features 631, similarly to the element 410 described above with respect to Fig. 4 .
- the buffer 640 stores these features to produce a matrix V 641.
- This matrix undergoes a decomposition using semi-supervised NMF 650, where the feature bases corresponding to speech are set to the values of the dictionary 611 given by the training phase.
- the other bases are updated by the semi-supervised NMF.
- the outputs of this decomposition 650 are the dictionary W 651 and the corresponding weights H 652. These matrices are used by a reconstruction element 660, which produces the de-noised speech signal.
- H' is the matrix extracted from the matrix H 652 comprising the weights corresponding to the speech bases W s .
- the time-domain signal is then obtained by the same approach as described above with respect to Fig. 5 for the reconstruction element 510.
- the semi-supervised NMF 650 is replaced by a WNMF decomposition 400 as described above with respect to Fig. 4 .
- a noise training phase similar to 610 is performed to estimate a noise feature dictionary from a training recording of noise.
- the dictionary W 651 is defined as the concatenation of the speech dictionary 611 and the noise dictionary, and the semi-supervised NMF 650 is replaced by a supervised NMF.
- the components of the system 600 described above may also be implemented as steps of a method.
- Fig. 7 shows a schematic diagram of a de-noising system 700 according to an implementation form.
- spectral components W speaker 713 and W noise 715 are estimated from clean speech W speaker 701 and noise V noise 703 separately using WNMF 707, 709. These spectral components W speaker 713 and W noise 715 are fed to a de-noising system 711, which exploits them to separate speech from noise.
- the noise components are estimated on the noisy speech V mix 705 without noise training by the de-noising system 711 which provides the de-noised speech 717.
- the de-noising system 711 is a supervised system. In an implementation form the de-noising system 711 is a semi-supervised system. In an implementation form the de-noising system 711 is an unsupervised NMF de-noising system where no a priori knowledge of the speech and noise models is available.
- Fig. 8 shows a schematic diagram of a device 800 for determining a dictionary of base components 804 from an input signal 802 according to an implementation form.
- the input signal 802 is represented by an input matrix V.
- the device 800 comprises a buffer 803 for storing the input matrix V.
- the device 800 further comprises decomposing means 801 for decomposing the input matrix V into a product of a non-negative base matrix W and a non-negative weight matrix H, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix W represent the dictionary of base components 804 of the input signal 802.
- the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal.
- the decomposing means is configured for decomposing the input matrix V by using a Non-negative Matrix Factorization.
- the decomposing means is configured to enforce a sparse decomposition of the non-negative base matrix W.
- the decomposing means comprises means for forming the non-negative weight matrix H such that a Wiener entropy of each column of the non-negative weight matrix H is close to zero.
- the decomposing means comprises means for minimizing a sum of Wiener entropy values of columns of the non-negative weight matrix H by using a cost function.
- the device 800 comprises means for updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
- the device 800 comprises means for reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix V, the non-negative base matrix W and the non-negative weight matrix H.
- magnitude spectrograms S k of the plurality of output signals are determined by a product of a column-vector W :, k constituted by the k-th column of the non-negative base matrix W and a row-vector H k ,: constituted by the k-th row of the non-negative weight matrix H.
- the device 800 comprises means for determining a dictionary of base components from a training speech signal according to the method 400 as described above with respect to Fig. 4 ; and means for forming a non-negative base matrix W of a noisy speech signal by using a semi-supervised Non-negative Matrix Factorization; and means for updating the non-negative base matrix W of the noisy speech signal with the non-negative base matrix W S of the training speech signal.
- the device 800 comprises means for reconstructing the speech signal based on the updated non-negative base matrix W of the noisy speech signal.
- the decomposing means comprises means for minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix H by using a cost function, the weighting parameters of the sum being the mean values of the columns of the matrix H.
- the present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.
- the present disclosure also supports a system configured to execute the performing and computing steps described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Description
- The present invention relates to a method and a device for determining a dictionary of base components from an input signal. In particular, the present invention relates to the processing of an acoustic signal input for the estimation of a feature vector dictionary for describing acoustic sources.
- Most audio signals are composed of a plurality of individual sound sources. Musical recordings, for example, comprise most of the time several instruments. In the case of speech communication, the signal often comprises, in addition to the speech itself, other interfering sounds which are recorded by the same microphone. Such interfering sounds can be for example ambient noise or other people talking in the same room.
- Several applications would take advantage of the separation of audio signal into several parts. One of them is the reduction of acoustic noise in telephonic communication, especially in the case of hand-free system where the noise level is often high because of the distance between the microphone and the speaker. Another usage of source separation is the extraction of some target instrument from musical signals, for karaoke or remixing application.
- Non-negative Matrix Factorisation (NMF) has been first proposed by Paatero: "Least Squares Formulation of Robust Non-Negative Factor Analysis", Chemometrics and Intelligent Laboratory Systems 37, pp. 23-35, 1997 and has been successfully applied to a wide variety of applications since then. In particular, this technique has become a standard method for audio source separation, where an input audio signal is to be separated into several signals corresponding to the different acoustic sources. It is based on a decomposition of the power spectrogram of the mixture into a non-negative combination of several spectral bases, each associated to one of the present sources. Non-negative Matrix Factorization (NMF) methods have been used in that context with relatively good results. Indeed, the non-negative constraint which is inherent to this technique complies with the structure of the audio spectrograms, and can allow for the decomposition of a sound into some meaningful components. These components form a dictionary of spectral bases which describe the signal. The decomposition typically aims to estimate spectral bases corresponding to different "parts" of the spectrogram, e.g. different sounds or speakers. A separation of these parts can then be performed by a partial reconstruction of the signal, considering only the wanted components. This technique has been applied by C. Joder, F. Weninger, F. Eyben, D. Virette, B. Schuller "Real-time Speech Separation by Semi-Supervised Nonnegative Matrix Factorization", Proc. International Conference on Latent Variable Analysis and Signal Separation, March 2012, in particular, to the separation of a target speaker from noisy recordings.
- The basic principle of NMF-based
audio processing 100 as schematically illustrated inFig. 1 is to find a locally optimal factorization of a short-timemagnitude spectrogram V 103 of anaudio signal 101 into two factors W and H, of which the first one W represents the spectra of the events occurring in thesignal 101 and the second one H their activation over time. The first factor W describes the component spectra of thesource model 109. The second factor H describes theactivations 107 of thesignal spectrogram 103 of theaudio signal 101. The first factor W and the second factor H are matched with the short-timemagnitude spectrogram V 103 of theaudio signal 101 by an optimization procedure. Thesource model 109 is pre-defined when applying supervised NMF and a joint estimation is applied for thesource model 109 when using unsupervised NMF. The source signal orsignals 113 can be derived from thesource spectrogram 111. - The conventional formulation of NMF is defined as follows. The matrix V defines a m × n matrix of non-negative real values. The goal is to approximate this matrix by the product of two other non-negative matrices
- Similarly, some of the spectral basis can be set to a constant value, fixed by a prior learning. This can be beneficial if one of the sources is known and sufficient data is available to estimate the characteristic spectra of this source. In this case, the corresponding columns of W are not updated. The methods wherein the matrix W is entirely constant during the decomposition and the method in which the matrix W is entirely updated are called supervised NMF and unsupervised NMF, respectively. In the case where only a part of the spectral basis is updated, the method is called semi-supervised NMF.
- The NMF decomposition is illustrated in
Fig. 2 by a simple example. The figure represents aspectrogram 201 represented by the matrix V, a matrix of twospectral bases 202 represented by the matrix W and the correspondingtemporal weights 203 represented by the matrix H. The greyscale of thespectrogram 201 represents the amplitude of the Fourier coefficients. The spectrogram defines an acoustic scene which can be described as the superposition of two so called "atomic sounds". By applying a two-component NMF to this spectrogram, the matrices W and H as defined inFig. 2 can be obtained. Each column of W can be interpreted as a basis function for the spectra contained in V, when weighted with the corresponding values of H. - Since the spectral bases are non-negative, they correspond to proper magnitude spectra, which can then be used to reconstruct each of the so called "atomic sounds". The example of
Fig. 2 is simplistic; however the NMF method can provide satisfactory results in separating different sound sources from realistic recordings. In these cases, a larger value of the order of decomposition r is used. Then, each "component", i.e. the product of one spectral basis with the corresponding temporal weights, is assigned to a specific source. The estimated spectrogram of each source is finally obtained by the sum of all the components attributed to the source. - However, in the conventional NMF method, the estimation of the dictionary of spectral bases often suffers from some inaccuracies and results in components representing several sources at the same time. Indeed, this method minimizes a reconstruction error between the original input and the decomposition, without taking into account the structure of the individual signals. As a result, the estimated bases can capture some unstructured so called "building blocks" which can be used to reconstruct several sources, whereas the goal is to match each basis to a specific source. In order to overcome this problem, several modifications of the standard NMF method have been proposed, which impose a structure by favoring some properties of the decomposition, such as temporal continuity or component sparsity.
- The sparsity property relates to the fact that the proportion of elements with non-zero value or, more generally, of non-negligible value is very small. In particular, the sparsity of the component activations is often enforced. This property relates to the fact that few components are active at the same time.
- A simple example of the usefulness of a sparsity constraint is represented in
Fig. 3 . Thespectrogram 300 corresponds to the succession of two musical notes, the second one having a pitch one octave higher than the first one. Theplots spectrogram 300 can also result in the estimation of thespectrograms - This constraint is generally achieved by adding a penalty term in the cost function to be minimized. The cost function then becomes
Document "Audio source separation informed by redundancy with greedy multiscale decompositions" (Munuel Moussallam et al, 2012-08-27, pages 2644-2648, XP032254797) describes an algorithm for audio source separation of repeated musical patterns. A Time-Frequency mask usually based on the power spectral density of the mixtures is constructed for the repeating musical background and the separation is performed by means of Wiener filtering relative to this mask.
Document "Sparse nonnegative matrix factorization with constraints" (Robert Peharz et al, 2012-03-15, XP028356707) discloses nonnegative matrix factorization to factorize a nonnegative matrix X into a product of nonnegative matrices W and H with ℓ°-constraints. - It is the object of the invention to provide a concept for improving sound source estimation when using Non-Negative Matrix Factorization decompositions.
- This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
- The invention is based on the finding that sound source estimation is improved when a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF) is used for the factorization which identifies different components of an input signal. Applying this technique to non-negative features describing an input signal, such as magnitude spectra, the features are decomposed into a sparse combination of non-negative feature bases. The decomposition can be used to separate the input signal into several output signals corresponding to different components. The obtained dictionary of feature bases can also be used to separate the corresponding components from another signal, by decomposing this other signal according to the elements of the dictionary.
- Aspects of the invention provide a novel method for enforcing a sparse decomposition, resulting in a dictionary of spectral bases which is more characteristic of the different parts of the signal, as will be presented in the following.
- In order to describe the invention in detail, the following terms, abbreviations and notations will be used:
- audio rendering:
- a reproduction technique capable of creating spatial sound fields in an extended area by means of loudspeakers or loudspeaker arrays,
- NMF:
- Non-negative matrix factorization,
- WNMF:
- Wiener entropy-constrained Non-negative Matrix Factorization.
- Vector 1-norm:
- The vector 1-norm is the matrix norm of an m times n matrix A defined as the sum of the absolute values of its elements,
- Hadamard product:
- The Hadamard product is a binary operation that takes two matrices of the same dimensions, and produces another matrix where each element ij is the product of elements ij of the original two matrices.
- According to a first aspect, the invention relates to a method for determining a dictionary of base components from an audio signal, the audio signal being represented by an input matrix which columns comprise features of the audio signal at different instances in time, the method comprising: decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, the decomposing being constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix, wherein components of the non-negative base matrix represent the dictionary of base components of the audio signal.
- The decomposing being constrained by a Wiener entropy measure, also denoted as Wiener entropy constraint is a new constraint providing a novel method for enforcing sparsity. The Wiener entropy or spectral flatness measures how flat the vector is. It is used as a sparsity penalty for NMF. By using that measure meaningful spectral patterns are estimated, speech separation quality, measured by both signal-based and perceptual criteria is improved. Compared to standard NMF the complexity increase is limited. The Wiener entropy constrained NMF (WNMF) can be integrated into any system using NMF.
- In a first possible implementation form of the method according to the first aspect, the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal.
- Thus, a specific audio source of a multi-source audio signal can be extracted from the noisy multi-source audio signal.
- In a second possible implementation form of the method according to the first aspect as such or according to the first implementation form of the first aspect, the decomposing is performed by using a Non-negative Matrix Factorization.
- The Wiener entropy measure can thus be adapted to a standard NMF factorization with only little overhead thereby saving computational complexity.
- In a third possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the decomposing constrained by the Wiener entropy measure is configured to enforce a sparse decomposition of the non-negative base matrix.
- Computing with sparse matrices improves speed and reduces complexity.
- In a fourth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the decomposing constrained by the Wiener entropy measure comprises: forming the non-negative weight matrix such that a Wiener entropy of each column of the non-negative weight matrix is close to zero.
- By that specific forming of the H matrix, reconstruction of the original signal is improved.
- In a fifth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the decomposing constrained by the Wiener entropy measure comprises: minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix by using a cost function.
- By using a cost function iterative or recursive adaptations can be applied which are computational efficient. Reconstruction of the original signal is improved.
- In a sixth possible implementation form of the method according to the fifth implementation form of the first aspect, the cost function is according to
- Such a cost function provides an efficient reconstruction of the original signal.
- In a seventh possible implementation form of the method according to the fifth implementation form or according to the sixth implementation form of the first aspect, the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
- Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
- In an eighth possible implementation form of the method according to the seventh implementation form of the first aspect, the multiplicative update rule is according to:
- These multiplicative update rules are easy to implement and fast converging.
- In a ninth possible implementation form of the method according to the fifth implementation form of the first aspect, the cost function is according to
- Such a cost function provides an efficient reconstruction of the original signal and a homogeneous estimation of the components, regardless of the amplitude of the original signal
- In a tenth possible implementation form of the method according to the ninth implementation form or according to the sixth implementation form of the first aspect, the method comprises: updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
- Multiplicative update rules are easy to implement and gradient descent algorithms converge to the locally optimum solution.
- In an eleventh possible implementation form of the method according to the tenth implementation form of the first aspect, the multiplicative update rule is according to:
- These multiplicative update rules are easy to implement and fast converging.
- In a twelfth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the method comprises: reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix, the non-negative base matrix and the non-negative weight matrix.
- The reconstructed signals are noise-reduced and they indicate the source components of the original audio signal.
- In a thirteenth possible implementation form of the method according to the twelfth implementation form of the first aspect, magnitude spectrograms Sk of the plurality of output signals are determined according to:
- The output signals can be superposed in order to obtain signals corresponding to the combination of several components, for source separation applications.
- In a fourteenth possible implementation form of the method according to the twelfth implementation form of the first aspect, magnitude spectrograms Sk of the plurality of output signals are determined by a product of a column-vector W :,k constituted by the k-th column of the non-negative base matrix W and a row-vector H k,: constituted by the k-th row of the non-negative weight matrix H.
- When the output signals are directly reconstructed, computational complexity is reduced.
- In a fifteenth possible implementation form of the method according to the thirteenth implementation form or according to the fourteenth implementation form of the first aspect, the method comprises: constructing output spectrograms by summing several of the magnitude spectrograms Sk of the plurality of output signals.
- In a sixteenth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the method comprises: determining a dictionary of base components from a training speech signal according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, and forming a non-negative base matrix of a noisy speech signal by extending the non-negative base matrix of the training speech signal; and updating the non-negative base matrix of the noisy speech signal by using a semi-supervised Non-negative Matrix Factorization.
- When using a training speech signal, source separation is improved as a speech signal which is not corrupted by noise is used for determining the dictionary of base components.
- In a seventeenth possible implementation form of the method according to the sixteenth implementation form of the first aspect, the method comprises: reconstructing the speech signal based on the updated non-negative base matrix of the noisy speech signal.
- Accuracy of the reconstruction is improved when the reconstruction is based on the updated base matrix of the noisy speech signal.
- According to a second aspect, the invention relates to a device for determining a dictionary of base components from an input signal represented by an input matrix, the device comprising: a buffer for storing the input matrix; and means for decomposing the input matrix into a product of a non-negative base matrix and a non-negative weight matrix, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix represent the dictionary of base components of the input signal.
- By using the Wiener entropy measure, meaningful spectral patterns are estimated and thus, speech separation quality, measured by both signal-based and perceptual criteria is improved. The complexity increase is not significant when compared to standard NMF implementations. The Wiener entropy constrained NMF can be integrated into any device using NMF.
- The methods and systems described herein may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit (ASIC). The means for decomposing the input matrix may be implemented as software in a Digital Signal Processor (DSP), in a micro-controller or in any other side-processor or as a hardware unit, e.g. within an application specific integrated circuit (ASIC).
- The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware or software of conventional mobile devices or hands-free communication systems or in new hardware or software dedicated for processing the methods described herein after.
- Aspects of the invention provide a method for decomposing a signal according to a Wiener entropy-constrained Non-negative Matrix Factorization (WNMF). This method brings a modification to Non-negative Matrix Factorization (NMF) to enforce a sparse decomposition of a non-negative matrix.
- The Wiener entropy, also called "spectral flatness", of a set of non-negative values is the ratio between the geometric mean and the arithmetic mean of these values. The Wiener entropy is always between zero and one, and it is equal to one if and only if all he values in the set are equal. Intuitively, a large value of the Wiener entropy corresponds to a "flat" plot and a small value corresponds to a "peaky" plot. Hence, to enforce the sparsity property of the NMF decomposition, it has to be ensured that the Wiener entropy of each column of H is small.
- In the WNMF method, the penalty term used to measure the sparsity of the decomposition is given by a weighted sum of the Wiener entropy values of the columns of the matrix H. The cost function to be minimized is then
- This maximum ensures that the penalty term is positive, and that sparsity is enforced even when one of the weights is equal to zero. A variety of functions can be used for measuring the reconstruction error. In an implementation form, the reconstruction error is defined as
- In an implementation form, the weighting parameters ωj used in the above defined cost function are all set to one.
-
-
- In an alternative implementation form, gradient-descent algorithms are applied instead of these multiplicative updates.
-
- Hence, the sparsity penalty applied to each instance in time is approximately proportional to the amplitude of the input signal at the corresponding instance in time.
-
-
- Another advantage of this setting is that the complexity of the parameter updates is reduced compared to the previous implementation.
- Further embodiments of the invention will be described with respect to the following figures, in which:
-
Fig. 1 shows a schematic diagram 100 of a conventional non-negative Matrix Factorization (NMF) technique; -
Fig. 2 shows three schematic diagrams 201, 202, 203 representing V, W and H matrices of a conventional Non-negative Matrix Factorization decomposition; -
Fig. 3 shows exemplary spectrograms of twomusical notes musical notes 300 andreconstructions -
Fig. 4 shows a schematic diagram of amethod 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form; -
Fig. 5 shows a schematic diagram of amethod 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form; -
Fig. 6 shows a schematic diagram of amethod 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form; -
Fig. 7 shows a schematic diagram of ade-noising system 700 according to an implementation form; and -
Fig. 8 shows a schematic diagram of adevice 800 for determining a dictionary ofbase components 804 from anaudio signal 802 according to an implementation form. -
Fig. 4 shows a schematic diagram of amethod 440 for determining a dictionary of base components from an audio signal by performing a WNMF decomposition according to an implementation form. - The
method 440 performs aWNMF decomposition 400 from a digital single-channelacoustic signal 401. Thedigital input signal 401 is input to a short-time transform module 410, which performs a windowing into short-time frames and a transform, so as to producenon-negative feature vectors 411, e.g. magnitude spectra. Abuffer 420 stores these features in order to produce thematrix V 421. TheWNMF module 430 then performs a decomposition of thematrix V 421, representing the magnitude spectra of the input signal. The outputs of this module are thematrices W 431 andH 432 which represent respectively the dictionary of feature bases and the temporal weights of these bases. -
Fig. 5 shows a schematic diagram of asystem 500 for decomposing an audio signal into a dictionary of base components and reconstructing a set of audio signals according to an implementation form. Thesystem 500 is adapted for separating a single-channel acoustic signal into several components (here r = 3). Thesystem 500 comprises afactorization element 400 performing theWNMF decomposition 400 as described above with respect toFig. 4 and areconstruction element 510. Thefactorization element 400 takes as input anacoustic signal 401 and estimates a dictionary offeature bases 431 and the correspondingtemporal weights 432 describing the signal. The result of the decomposition is input to thereconstruction module 510, which producesseveral output signals - In an implementation form, the
reconstruction module 510 exploits a so-called "soft mask" approach as described in the following. Fork = 1 ... 3, W :,k is the column-vector constituted by the k-th column of W and H k,: is the row-vector constituted by the k-th row of H. A magnitude spectrogram Sk is calculated as: - The three obtained matrices constitute the magnitude spectrograms of the three output signals. The time-domain signal are then obtained by a standard approach, involving an inverse Fourier transform exploiting the phase of the original complex spectrogram, followed by an overlap-add procedure.
- In an implementation form, the output signals are then superposed in order to obtain signals corresponding to the combination of several components, for a source separation application.
- In another implementation form, the magnitude spectrogram of the output signals are directly reconstructed as Sk = W :,k · H k,: .
- The components of the
system 500 described above may also be implemented as steps of a method. -
Fig. 6 shows a schematic diagram of asystem 600 for decomposing an audio signal into a dictionary of base components applied to a noisy speech signal according to an implementation form. The decomposition is applied to the reduction of noise in a noisy speech signal. Thissystem 600 involves aprior training phase 610 which comprises afactorization element 400 performing theWNMF decomposition 400 as described above with respect toFig. 4 . In the training phase, atraining speech signal 601 is input to thefactorization element 400, which computes a dictionary offeature bases 611 and a matrix oftemporal weights 612 corresponding to the WNMF decomposition of the training signal. - The
system 600 further comprises a short-time transform 630, abuffer 640, asemi-supervised NMF module 650 and areconstruction module 660. A single-channelnoisy speech signal 621 undergoes a short-time transform 630 which calculatesnon-negative features 631, similarly to theelement 410 described above with respect toFig. 4 . Thebuffer 640 stores these features to produce amatrix V 641. This matrix undergoes a decomposition usingsemi-supervised NMF 650, where the feature bases corresponding to speech are set to the values of thedictionary 611 given by the training phase. The other bases are updated by the semi-supervised NMF. The outputs of thisdecomposition 650 are thedictionary W 651 and the correspondingweights H 652. These matrices are used by areconstruction element 660, which produces the de-noised speech signal. -
- The time-domain signal is then obtained by the same approach as described above with respect to
Fig. 5 for thereconstruction element 510. - In an implementation form, the
semi-supervised NMF 650 is replaced by aWNMF decomposition 400 as described above with respect toFig. 4 . - In yet another implementation form, a noise training phase similar to 610 is performed to estimate a noise feature dictionary from a training recording of noise. In this case, the
dictionary W 651 is defined as the concatenation of thespeech dictionary 611 and the noise dictionary, and thesemi-supervised NMF 650 is replaced by a supervised NMF. - The components of the
system 600 described above may also be implemented as steps of a method. -
Fig. 7 shows a schematic diagram of ade-noising system 700 according to an implementation form. - In a training phase,
spectral components W speaker 713 andW noise 715 are estimated fromclean speech W speaker 701 andnoise V noise 703 separately usingWNMF spectral components W speaker 713 andW noise 715 are fed to ade-noising system 711, which exploits them to separate speech from noise. The noise components are estimated on thenoisy speech V mix 705 without noise training by thede-noising system 711 which provides thede-noised speech 717. - In an implementation form, the
de-noising system 711 is a supervised system. In an implementation form thede-noising system 711 is a semi-supervised system. In an implementation form thede-noising system 711 is an unsupervised NMF de-noising system where no a priori knowledge of the speech and noise models is available. -
Fig. 8 shows a schematic diagram of adevice 800 for determining a dictionary ofbase components 804 from aninput signal 802 according to an implementation form. Theinput signal 802 is represented by an input matrix V. Thedevice 800 comprises abuffer 803 for storing the input matrix V. Thedevice 800 further comprises decomposing means 801 for decomposing the input matrix V into a product of a non-negative base matrix W and a non-negative weight matrix H, wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix W represent the dictionary ofbase components 804 of theinput signal 802. - In an implementation form, the dictionary of base components represents a specific audio source of a plurality of audio sources of the audio signal. In an implementation form, the decomposing means is configured for decomposing the input matrix V by using a Non-negative Matrix Factorization. In an implementation form, the decomposing means is configured to enforce a sparse decomposition of the non-negative base matrix W. In an implementation form, the decomposing means comprises means for forming the non-negative weight matrix H such that a Wiener entropy of each column of the non-negative weight matrix H is close to zero. In an implementation form, the decomposing means comprises means for minimizing a sum of Wiener entropy values of columns of the non-negative weight matrix H by using a cost function. In an implementation form, the cost function is according to
device 800 comprises means for updating the cost function by one of a multiplicative update rule and a gradient descent algorithm. In an implementation form, the multiplicative update rule is according to: - In an implementation form, the
device 800 comprises means for reconstructing a plurality of output signals from the audio signal, the reconstruction being based on the input matrix V, the non-negative base matrix W and the non-negative weight matrix H. In an implementation form, magnitude spectrograms Sk of the plurality of output signals are determined according to: - In an implementation form, magnitude spectrograms Sk of the plurality of output signals are determined by a product of a column-vector W :,k constituted by the k-th column of the non-negative base matrix W and a row-vector H k,: constituted by the k-th row of the non-negative weight matrix H.
- In an implementation form, the
device 800 comprises means for determining a dictionary of base components from a training speech signal according to themethod 400 as described above with respect toFig. 4 ; and means for forming a non-negative base matrix W of a noisy speech signal by using a semi-supervised Non-negative Matrix Factorization; and means for updating the non-negative base matrix W of the noisy speech signal with the non-negative base matrix WS of the training speech signal. In an implementation form, thedevice 800 comprises means for reconstructing the speech signal based on the updated non-negative base matrix W of the noisy speech signal. - In another implementation form, the decomposing means comprises means for minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix H by using a cost function, the weighting parameters of the sum being the mean values of the columns of the matrix H. In an implementation form, the cost function is according to
-
- From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided.
- The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.
- The present disclosure also supports a system configured to execute the performing and computing steps described herein.
- Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto . It is therefore to be understood that within the scope of the appended claims, the invention may be practiced otherwise than as specifically described herein.
Claims (13)
- A method (440) for determining a dictionary of base components (431) from an audio signal (401), the audio signal (401) being represented by an input matrix (V) which columns comprise features of the audio signal (401) at different instances in time, the method (440) comprising:decomposing (430) the input matrix (V) into a product of a non-negative base matrix (W) and a non-negative weight matrix (H), the decomposing (430) being constrained by a Wiener entropy measure with respect to elements of the non-negative weight matrix (H), wherein components of the non-negative base matrix (W) represent the dictionary of base components (431) of the audio signal (401;wherein the decomposing (430) constrained by the Wiener entropy measure comprises:minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix (H) by using a cost function, andwherein the cost function is according to
- The method (440) of claim 1, wherein the dictionary of base components (431) represents a specific audio source of a plurality of audio sources of the audio signal (401).
- The method (440) of claim 1 or claim 2, wherein the decomposing (430) uses a Non-negative Matrix Factorization.
- The method (440) of one of the preceding claims, wherein the decomposing (430) constrained by the Wiener entropy measure is configured to enforce a sparse decomposition of the non-negative base matrix (W).
- The method (440) of one of the preceding claims, wherein the decomposing (430) constrained by the Wiener entropy measure comprises:forming the non-negative weight matrix (H) such that a Wiener entropy of each column of the non-negative weight matrix (H) is close to zero.
- The method (440) of claim 1, comprising:updating the cost function by one of a multiplicative update rule and a gradient descent algorithm.
- The method (440) of claim 6, wherein the multiplicative update rule is according to:
- The method (440) of claim 6, wherein the multiplicative update rule is according to:
- The method (500) of one of the preceding claims, comprising:reconstructing (510) a plurality of output signals (511, 512, 513) from the audio signal (401), the reconstruction (510) being based on the input matrix (V), the non-negative base matrix (W) and the non-negative weight matrix (H).
- The method (500) of claim 9, wherein magnitude spectrograms Sk of the plurality of output signals (511, 512, 513) are determined according to:
wherein magnitude spectrograms Sk of the plurality of output signals (511, 512, 513) are determined by a product of a column-vector W :,k constituted by the k-th column of the non-negative base matrix W and a row-vector H k,: constituted by the k-th row of the non-negative weight matrix H. - The method (500) of claim 10, comprising:constructing output spectrograms by summing several of the magnitude spectrograms Sk of the plurality of output signals (511, 512, 513).
- The method (600) of one of the preceding claims, comprising:determining (610) a dictionary of base components (611) from a training speech signal (601) according to one of the methods 1 to 13; andforming (651) a non-negative base matrix (W) of a noisy speech signal (621) by extending the non-negative base matrix (WS) of the training speech signal (601); andupdating the non-negative base matrix (W) of the noisy speech signal (621) by using a semi-supervised Non-negative Matrix Factorization.
- Device (800) for determining a dictionary of base components (804) from an input signal (802) represented by an input matrix (V), the device (800) comprising:a buffer (803) for storing the input matrix (V); andmeans (801) for decomposing the input matrix (V) into a product of a non-negative base matrix (W) and a non-negative weight matrix (H), wherein the decomposing is constrained by a Wiener entropy measure and wherein components of the non-negative base matrix (W) represent the dictionary of base components (804) of the input signal (802);wherein the decomposing constrained by the Wiener entropy measure comprises:minimizing a weighted sum of Wiener entropy values of columns of the non-negative weight matrix (H) by using a cost function, andwherein the cost function is according to
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2012/073149 WO2014079484A1 (en) | 2012-11-21 | 2012-11-21 | Method for determining a dictionary of base components from an audio signal |
Publications (2)
Publication Number | Publication Date |
---|---|
EP2912660A1 EP2912660A1 (en) | 2015-09-02 |
EP2912660B1 true EP2912660B1 (en) | 2017-01-11 |
Family
ID=47278271
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP12794680.4A Active EP2912660B1 (en) | 2012-11-21 | 2012-11-21 | Method for determining a dictionary of base components from an audio signal |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP2912660B1 (en) |
WO (1) | WO2014079484A1 (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017143095A1 (en) * | 2016-02-16 | 2017-08-24 | Red Pill VR, Inc. | Real-time adaptive audio source separation |
CN105976806B (en) * | 2016-04-26 | 2019-08-02 | 西南交通大学 | Active noise control method based on maximum entropy |
WO2017217412A1 (en) * | 2016-06-16 | 2017-12-21 | 日本電気株式会社 | Signal processing device, signal processing method, and computer-readable recording medium |
US10679646B2 (en) | 2016-06-16 | 2020-06-09 | Nec Corporation | Signal processing device, signal processing method, and computer-readable recording medium |
JP6615733B2 (en) * | 2016-11-01 | 2019-12-04 | 日本電信電話株式会社 | Signal analysis apparatus, method, and program |
CN106897685A (en) * | 2017-02-17 | 2017-06-27 | 深圳大学 | Face identification method and system that dictionary learning and sparse features based on core Non-negative Matrix Factorization are represented |
CN109829481B (en) * | 2019-01-04 | 2020-10-30 | 北京邮电大学 | Image classification method and device, electronic equipment and readable storage medium |
CN110428848B (en) * | 2019-06-20 | 2021-10-29 | 西安电子科技大学 | Speech enhancement method based on public space speech model prediction |
CN111009256B (en) * | 2019-12-17 | 2022-12-27 | 北京小米智能科技有限公司 | Audio signal processing method and device, terminal and storage medium |
CN111179960B (en) * | 2020-03-06 | 2022-10-18 | 北京小米松果电子有限公司 | Audio signal processing method and device and storage medium |
-
2012
- 2012-11-21 EP EP12794680.4A patent/EP2912660B1/en active Active
- 2012-11-21 WO PCT/EP2012/073149 patent/WO2014079484A1/en active Application Filing
Non-Patent Citations (1)
Title |
---|
None * |
Also Published As
Publication number | Publication date |
---|---|
WO2014079484A1 (en) | 2014-05-30 |
EP2912660A1 (en) | 2015-09-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2912660B1 (en) | Method for determining a dictionary of base components from an audio signal | |
Kameoka et al. | A multipitch analyzer based on harmonic temporal structured clustering | |
Virtanen et al. | Compositional models for audio processing: Uncovering the structure of sound mixtures | |
EP2877993B1 (en) | Method and device for reconstructing a target signal from a noisy input signal | |
Grais et al. | Single channel speech music separation using nonnegative matrix factorization and spectral masks | |
Ozerov et al. | Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation | |
Sprechmann et al. | Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling. | |
Hassan et al. | A comparative study of blind source separation for bioacoustics sounds based on FastICA, PCA and NMF | |
Nie et al. | Deep learning based speech separation via NMF-style reconstructions | |
Mohammadiha et al. | Speech dereverberation using non-negative convolutive transfer function and spectro-temporal modeling | |
Adiloğlu et al. | Variational Bayesian inference for source separation and robust feature extraction | |
Duong et al. | An interactive audio source separation framework based on non-negative matrix factorization | |
Lyubimov et al. | Non-negative matrix factorization with linear constraints for single-channel speech enhancement | |
Jao et al. | Monaural music source separation using convolutional sparse coding | |
Kantamaneni et al. | Speech enhancement with noise estimation and filtration using deep learning models | |
Şimşekli et al. | Non-negative tensor factorization models for Bayesian audio processing | |
Duong et al. | Gaussian modeling-based multichannel audio source separation exploiting generic source spectral model | |
Li et al. | Blind monaural singing voice separation using rank-1 constraint robust principal component analysis and vocal activity detection | |
Sprechmann et al. | Learnable low rank sparse models for speech denoising | |
Baby | Supervised speech dereverberation in noisy environments using exemplar-based sparse representations | |
Ben Messaoud et al. | Sparse representations for single channel speech enhancement based on voiced/unvoiced classification | |
Li et al. | FastMVAE: A fast optimization algorithm for the multichannel variational autoencoder method | |
Adiloğlu et al. | A general variational Bayesian framework for robust feature extraction in multisource recordings | |
Lee et al. | Discriminative training of complex-valued deep recurrent neural network for singing voice separation | |
Shin et al. | Auxiliary-function-based independent vector analysis using generalized inter-clique dependence source models with clique variance estimation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20150522 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20160404 |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
INTG | Intention to grant announced |
Effective date: 20160715 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: REF Ref document number: 861922 Country of ref document: AT Kind code of ref document: T Effective date: 20170115 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R096 Ref document number: 602012027800 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: LT Ref legal event code: MG4D |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: MP Effective date: 20170111 |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: MK05 Ref document number: 861922 Country of ref document: AT Kind code of ref document: T Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: NO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170411 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170412 Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170511 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170511 Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170411 Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 6 Ref country code: DE Ref legal event code: R097 Ref document number: 602012027800 Country of ref document: DE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
26N | No opposition filed |
Effective date: 20171012 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171130 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171130 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171121 |
|
REG | Reference to a national code |
Ref country code: BE Ref legal event code: MM Effective date: 20171130 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: MM4A |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MT Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171121 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 7 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171121 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171130 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO Effective date: 20121121 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: AL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20170111 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20230929 Year of fee payment: 12 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20231006 Year of fee payment: 12 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20230929 Year of fee payment: 12 |