WO2013030134A1 - Method and apparatus for acoustic source separation - Google Patents

Method and apparatus for acoustic source separation Download PDF

Info

Publication number
WO2013030134A1
WO2013030134A1 PCT/EP2012/066549 EP2012066549W WO2013030134A1 WO 2013030134 A1 WO2013030134 A1 WO 2013030134A1 EP 2012066549 W EP2012066549 W EP 2012066549W WO 2013030134 A1 WO2013030134 A1 WO 2013030134A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
segment
data
audio
training data
Prior art date
Application number
PCT/EP2012/066549
Other languages
French (fr)
Inventor
Ji Ming
Ramji Srinivasan
Daniel CROOKES
Original Assignee
The Queen's University Of Belfast
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Queen's University Of Belfast filed Critical The Queen's University Of Belfast
Publication of WO2013030134A1 publication Critical patent/WO2013030134A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Definitions

  • the present invention relates to the separation of component acoustic sources from a single channel mixture of acoustic sources.
  • the invention relates particularly to the separation of simultaneous multi-talker speech from a single channel recording.
  • a single microphone recording of this multi-talker speech will result in a mixed speech signal, and is generally termed single-channel mixed speech. Separating the component speech signals from a single channel mixture is a challenge.
  • Separation is usually performed on a frame-by- frame basis, by finding a linear combination of the component basis sets that matches a given mixed speech frame.
  • a frame of speech is generally very short ranging from 10 ms to 30 ms.
  • the changing in sound properties over frames i.e., time
  • temporal dynamics Analysis over a small number of frames gives short-term temporal dynamics, while analysis over a larger number of frames gives long-term temporal dynamics.
  • Long term temporal dynamics of a speech utterance is one of the most important characteristics of the utterance that distinguishes the utterance from noise, including other speaker's utterances.
  • CASA is another category of algorithms, which segments
  • psychoacoustic cues e.g., pitch, harmonic structures, modulation correlation, etc.
  • the method can only capture short-term speech temporal dynamics and hence suffers from ambiguity in correctly classifying the psychoacoustic cues.
  • each component speech signal is first represented by a statistical model using training samples from the component speaker.
  • the mixed speech signal can then be represented by a model by combining the statistical models of the component speech signals.
  • a first aspect of the invention provides a method of separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said method comprising: selecting respective training data for each source audio signal; for respective segments of data representing said mixed signal, determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.
  • a second aspect of the invention provides an apparatus for separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said apparatus comprising: training data selecting means configured to select respective training data for each source audio signal; a segment analyzer configured to, for respective segments of data representing said mixed signal, determine which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing means configured to reconstruct each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.
  • a third aspect of the invention provides an audio signal separation system comprising the apparatus of the second aspect of the invention and an acousto- electric transducer for creating a said single channel mixture of audio signals from a plurality of simultaneous acoustic signals.
  • a fourth aspect of the invention provides a computer program product comprising computer usable code for causing a computer to perform the method of the first aspect of the invention. Preferred features are recited in the dependent claims.
  • Preferred embodiments of the invention provide a method for separating component acoustic sources from a single channel mixture of said component acoustic sources, with particular application to single-channel speech separation.
  • CLOSE Composition of LOngest SEgments
  • CLOSE- 1 composition of LOngest SEgments
  • CLOSE-2 Two different possible ways to realize the CLOSE method: CLOSE- 1 and CLOSE-2, are disclosed hereinafter by way of illustration. In CLOSE- 1, all the acoustic sources are separated at the same time while in CLOSE-2 one source is separated at a time. In terms of computation speed, CLOSE-2 is faster than CLOSE- 1.
  • Figure 1 is a block diagram illustrating a system for separating multiple acoustic Sources, in particular speech;
  • Figure 2 is a block diagram illustrating a preferred acoustic source separation process embodying the invention;
  • Figure 3 is a schematic illustration of said preferred acoustic source separation process in the context of two-talker speech separation
  • Figure 4 presents an algorithm suitable for implementing said preferred acoustic source separation process in the context of two-talker speech separation (CLOSE- i);
  • Figure 5 presents an alternative algorithm for implementing said preferred acoustic source separation process in the context of two-talker speech separation (CLOSE-2); and
  • Figure 6 is a schematic illustration of a preferred reconstruction process suitable for use with embodiments of the invention.
  • a digital signal processing system in the form of an acoustic source separation system, and in particular a speech signal separation system.
  • the system 10 receives acoustic signals 12 from multiple (K) acoustic sources Si,... ,S K , which in this example are assumed to be human speakers although they could be any source of speech signals.
  • the speech signals 12 are received by a microphone 14 (or other acousto-electric transducer or audio signal recording device) to produce a single channel mixture (SCM) of the speech signals (or more particularly of an electronic representation of the acoustic speech signals as produced by the microphone).
  • SCM single channel mixture
  • the single channel mixture is fed into a separator A10, which separates the component acoustic sources to produce acoustic output signals
  • the separation is performed on the electronic representation of the mixed signals and reproduced as acoustic signals by speakers 16, or other suitable transducers.
  • the SCM comprises digital audio data organized in frames.
  • the output signals of the separator A10 typically also comprise digital audio data organized in frames.
  • the separation process performed by separator A10 is a multistage process which is illustrated in Figure 2.
  • the SCM is subjected to a preprocessing stage Al 1.
  • each frame of the input data undergoes extraction of its spectral features (which may be referred to as spectral or frequency transformation).
  • spectral features which may be referred to as spectral or frequency transformation.
  • TSSF time sequence of spectral features
  • Spectral features typically comprise data representing the frequency component(s) of the signal, e.g. data identifying the frequency component(s) and the respective magnitude and/or phase of those component(s).
  • the TSSF is denoted by TSSFSC M -
  • the SCM may be first divided into short-time frames of, for example 20 ms, with a frame period of, for example, 10 ms (or otherwise sampled). Each frame may then be represented in the form of log spectral magnitudes.
  • the invention is not limited to any specific spectral feature set or time domain to frequency domain transformations.
  • Some suitable known methods for representing the spectral features include: squared amplitude; cepstrum; Mel- frequency cepstral coefficients (MFCC); Log magnitude spectrum; and Linear predictive coding (LPC)
  • phase information of the SCM is also extracted and stored for the later use in a source reconstruction stage A14. This is conveniently achieved as part of the spectral transformation of the SCM. Typically, a Fourier transform is used to produce the respective amplitude and phase spectrum.
  • the phase spectrum obtained for the test speech (SCM) is used during the reconstruction process to produce the separated component speech signals from the mixed speech signals.
  • the separator A10 is operable with training data, which may be received from an external source or may be stored by the system 10 in any suitable storage device 20 accessible by the separator A10.
  • the training data comprises audio data signals, or derivatives thereof, against which the SCM can be compared.
  • the training data comprises, or is derived from, digital audio data signals, preferably organized in frames.
  • the training data is organized into a plurality of acoustic source classes T ls ...,T N .
  • Each acoustic source class comprises one or more respective training audio data signals, or data derived therefrom, from one or more acoustic sources.
  • signals from acoustic sources regarded as being sufficiently similar to one another are included in the same acoustic source class (for example, a single speaker; a collection of speakers of the same gender; a collection of speakers for a particular ethnic group; a particular musical instrument; a particular music genre etc.).
  • the acoustic source(s) for the training data are not the same as the acoustic sources of the signals that make up the SCM, i.e. the separator A10 does not require training data from the actual acoustic sources whose mixed signals it is separating.
  • the training data for each acoustic source class are subjected to a modelling process A12, to yield, for each acoustic class, a respective model for the time sequences of spectral features for the respective training data: TSSF-Mi,...,TSSF- M .
  • the training data comprises multiple audio training signals (also referred to as training utterances or training data components) for each class.
  • the respective audio training signals (or more particularly a frequency representation thereof) for each class are modeled collectively (per class) to provide a single model for each class.
  • the model for each class comprises a respective model component for each of the respective audio training signals.
  • the modelling process A12 comprises a combination of statistical modelling and data driven modelling.
  • the modeler A12 applies a mixture model, preferably a Gaussian Mixture Model (GMM), to the training data of each class.
  • GMM Gaussian Mixture Model
  • Other models, especially probabilistic models may be used.
  • the preferred modeling process involves subjecting the training utterances for each class to spectral (frequency) transformation (typically producing a respective spectral feature vector for each frame), and applying mixture modeling to all of the resulting spectral data to produce a respective model (in this case a GMM) for each acoustic training class. This typically involves fitting a set of multi- dimensional Gaussian, or other, functions to the spectral data for each class to produce the mixture model. This is the aforementioned statistical modeling.
  • T ⁇ represent a training utterance 34 for speaker class ⁇ , where T is the number of frames in the utterance and x x t is the spectral feature vector of the frame at time t .
  • G k the Gaussian mixture model (GMM) for speaker class ⁇ , of ⁇ ⁇ Gaussian components, trained using all the training utterances ⁇ ⁇ .
  • each speaker class GMM may contain 512 Gaussian components with diagonal covariance matrices.
  • a data driven model for each training utterance ⁇ ⁇ may be built by taking each frame from ⁇ ⁇ and finding the Gaussian component in G k that produces maximum likelihood for the frame, i.e. identifying the Gaussian component that is the most likely to match the spectral features of the frame.
  • Equation (2) is called an utterance model which can be considered as part of the TSSF-M 3 ⁇ 4 , of the training data of speaker class ⁇ .
  • an utterance model ⁇ ⁇ is created for each training utterance ⁇ ⁇ of each speaker class ⁇ .
  • All the training utterances models ⁇ ⁇ for speaker class ⁇ together form the TSSF-M 3 ⁇ 4 , for the speaker class.
  • the TSSF-Mi,...,TSSF-M K required for the source separation are extracted from the TSSF-M I ,...,TSSF-MN. Extraction can be manual or by means of algorithms. In the absence of knowledge of the acoustic source classes in the SCM, any source identification and clustering algorithms known in the art can be used to identify TSSF-Mi,...,TSSF-M K . For example, known speaker clustering methods such as those disclosed in United States Patents US5983178 and US5598507 may be used.
  • the single channel mixture TSSFSC M and the training data TSSF-Mi,...,TSSF-M K are then analyzed in order to identify the longest segment compositions (of the training data), and the corresponding matching component segments (of the TSSFSC M ), for use in separation.
  • this analysis is performed by analyzer A13, which is labeled CLOSE (Composition of LOngest SEgments).
  • CLOSE Composition of LOngest SEgments
  • a single- channel mixed speech signal 30 (which may also be referred to as the test utterance) is composed of two simultaneous speech utterances (audio signals) from two speakers.
  • Training data 32 comprising a respective plurality of training utterances 34 for speaker classes ⁇ and ⁇ is available. It is assumed that the speaker classes ⁇ and ⁇ are relevant to the speakers and have been selected in the manner above.
  • Figure 3 illustrates how a test segment y tT of the test utterance 30 is subjected to signal separation by finding a respective segments ⁇ ⁇ hj and m y u . v from the training utterances 34 of speaker classes ⁇ and ⁇ such that the selected training segments, when combined, match (or substantially match, i.e. are deemed to constitute an acceptable match) the test segment.
  • matching in this context involves finding the training segments that, when combined, give the highest probability (or maximum likelihood) of representing the test segment. This is referred to as segment composition.
  • each speaker is associated with a respective speaker class.
  • N speaker classes are used for N speakers.
  • the same speaker class can be used for more than one speaker and so N different speaker classes need not necessarily be used.
  • N respective training utterances are combined during segment composition, each training utterance belonging to the class of the respective speaker.
  • the preferred CLOSE method of A14 maximizes the length of the segment composition as a means of minimizing the uncertainty of the component segments in the composition and hence the error of separation.
  • it is the length of the test segment y x that is maximized, and this in turn maximizes the lengths of the composite training segments j and m y u . v that match y tT . (see equation (8) hereinafter).
  • a probabilistic approach is used as one possible way of finding the matching score for the longest segment composition.
  • the method described for finding the matching score should not be considered exclusive.
  • CLOSE- 1 and CLOSE-2 two different embodiments for determining longest segment compositions are now described, being referred to as CLOSE- 1 and CLOSE-2. These are described in the context of two-talker speech separation for illustrative purposes.
  • CLOSE- 1 is described first, set in the context of speech separation from SCM containing simultaneous speech from two speakers.
  • two matching training component segments m ;i , . and m y u . v are identified by using the posterior probability P(m x .
  • the posterior probability is the (new) probability given a prior probability and some test evidence, in this case a test segment, and wherein the probability is a measure of the likelihood that a combination of the training component segments matches the test segment.
  • m ;i . ., m ⁇ ;M:v ) is the likelihood that the given test segment y a is matched by combining the two training component segments , m y u . v .
  • this segmental likelihood function can be written as
  • P(y t -, ⁇ m ,i:j ⁇ i ,u:v ) ⁇ p(y s , ⁇ ( ⁇ ) , ⁇ ⁇ , ⁇ ( ⁇ ) ) where p(y E
  • ⁇ ( ⁇ ) and ⁇ ( ⁇ ) represent time warping functions between the test segment a and the two training segments ⁇ ⁇ ; , . and m y u . v in forming the match.
  • the use of time warping functions is optional and allows for variation in the rate of speech.
  • the denominator is expressed as the sum of two terms.
  • the first term is the average likelihood that the given test segment y tx is matched by
  • equation (1) the following expression can be used:
  • the second approximation is based on the assumption that matching and hence highly likely test-training composition dominates the mixture likelihood.
  • composition between matching test/training segments and a smaller posterior probability for composition between mismatching test/training segments.
  • the posterior probability formulation (3) has another advantageous characteristic:
  • the longest segment composition may be obtained by first finding the most-likely training component segments for each fixed-length test segment y a , and then finding the maximum test-segment length (i.e., ) that results in the maximum posterior probability.
  • CLOSE- 1 for two-talker speech separation is outlined as an algorithm in Figure 4.
  • CLOSE- 1 both training component frames/segments for each test
  • the unconstrained component segment is formed by choosing the frames freely from the training data (mapped to GMM G y ), to maximize the likelihood with the constrained component segment.
  • Equation (9) gives a "text-dependent" likelihood of the test segment, dependent on the temporal dynamics of the training component segment ⁇ ⁇ ; . . .
  • Equation (10) is only a function of the temporally constrained training segment ⁇ ⁇ ; . . . Similar to (8), the longest segment compositions can be located between the test segments and the temporally constrained training segments, by
  • the longest matching test segment y rt max and the corresponding temporally constrained training segment ; , . are obtained by first finding the most-likely training segment for each fixed-length test segment tx , and then finding the maximum test-segment length (i.e., ) that results in the maximum posterior probability
  • Equation (11) shows the estimation of the temporally constrained component
  • FIG. 5 outlines CLOSE-2 for two-talker speech separation. As described above, each time, only one of the matching training segments forming the segment composition is subjected to
  • CLOSE-2 has a search complexity of about N x + N y
  • the likelihood of a test frame associated with two training component frames, p(y t ⁇ m ,m ) is calculated (i.e. the likelihood that a combination of the training frames matches the test frame), where m x and m y each correspond to a Gaussian component in the appropriate speaker class's
  • GMM i.e., ⁇ (x ⁇ m x ) and g y (x ⁇ m ) , which model the probability distributions of the training component frames.
  • log-max Algonquin
  • lifted max or parallel model combination, that can be used to derive the likelihood for the test frame, i.e. the likelihood that a
  • the log-max model may be used to obtain this likelihood.
  • any algorithms known in the art can be used to calculate the likelihood p(y t ⁇ m ,m ) .
  • y t For each frame, its log power spectrum is calculated as the spectral feature.
  • p(y t ⁇ m x ,m y ) can be expressed as
  • p(y t ⁇ m x ,m y ) Y[p(y tf ⁇ m x ,m y ) where py t ⁇ m x ,m y ) is the likelihood of the log power of the / th channel.
  • x Xf and x f represent the log powers of the same channel of the two component frames, subject to probability distributions g x (x ⁇ m x ) and g y (x ⁇ m ) .
  • y tf -max(x Xf ,x y f ) Assume y tf -max(x Xf ,x y f ) .
  • py t ⁇ m x ,m y ) can be written as
  • the respective matching training segments may be regarded as comprising, or being representative of, the separated speech signals.
  • two training segments are identified (one for each speaker class), and these segments may be considered as constituting the separated speech signals (or more particularly as constituting data from which the respective audio source signal can be reconstructed).
  • component speaker classes/frames are modelled with gains different from the training data. Rewrite the component-frame
  • the gain updates a x and a y are calculated at the frame level on a frame-by- frame basis, by maximizing the test frame likelihood p(y t ⁇ m ,m ) against a set of predefined gain update values for each component frame.
  • p(y t ⁇ m x ,m y ) max Y[ p(y t f ⁇ m x , m y ,a x > a y ) (14)
  • g ⁇ and g are the predefined gain-update value sets for speaker class ⁇ and ⁇
  • p y t ⁇ m x , m y , a x , a y ) is the local channel likelihood (13) with each component Gaussian including a corresponding gain update.
  • the analyzer A13 produces, as an output, a
  • Each set of training elements may be said to form a
  • each training segment is selected from one or more of the training audio signals 34 of the training data 32.
  • the respect set of training segments for each audio source comprises a sequence of training segments taken from at least one, and typically only one, acoustic class associated with the respective audio source.
  • a respective training segment is
  • the analyzer for each audio source the analyzer produces a respective sequence of training data segments taken from one or more the training audio signals of the acoustic class associated with the respective audio source, a respective training data segment being selected for each frame of the SCM by determining the best matching segment composition for that frame.
  • the training segments 62 for a given audio source are aligned temporally, conveniently by frame. As can be seen from Figure 6 each training segment 62 is aligned according to its start time, conveniently the start time (as indicated by the vertical lines 60) of the frame for which it is selected.
  • the training segments 62 may be of different lengths and may, for example, extend across more than one frame.
  • one or more of the training segments 62 may overlap with one another when aligned temporally, in this case by frame.
  • the training segments 62 may therefore be said to overlap temporally when aligned.
  • a respective portion of one or more of the training segments 62 may be aligned with each frame of the TSSFSC M (as represented by the vertical slices between lines 60 in Figure 6).
  • the output from the analyzer A13 is fed to the reconstruction module A14 to estimate the individual audio component sources in the SCM.
  • Figure 6 shows the preferred reconstruction process schematically.
  • the best matching training segments 62 produced by the analyzer A13 are first aligned with one another, conveniently by starting frame. For each frame of the TSSFSC M , the respective spectral feature(s) of the portion of the, or each, training segment 62 that is aligned with said frame are combined to produce corresponding composite spectral feature(s) for the corresponding frame of the separated audio signal (this is represented in Figure 6 by the merging module A50).
  • the composite spectral feature(s) are produced by averaging the, or each, training segment spectral feature(s) across each frame, although other combining functions may be used. This results in a respective set of composite spectral feature(s) for each frame of the separated signal.
  • the spectral features typically comprise data representing the frequency component(s) of the segment portion, e.g. data identifying the frequency component(s) and the respective magnitude of those component(s).
  • An estimate of S x ⁇ can be obtained by taking all the matching training segments that contain s x ⁇ and averaging over the corresponding training frames (a way of merging the matching segments) A50.
  • each component frame is estimated through
  • DFT magnitudes of the training frames may be used as the magnitude spectra
  • the separator A10 may be implemented in hardware and/or computer program code as is convenient.
  • any one or more of the pre-processing Al 1 , modeling A12, CLOSE A13 and reconstruction A14 may be implemented in hardware and/or computer program code as is convenient or best suits a given application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Stereophonic System (AREA)

Abstract

A method of separating a plurality of source audio signals from a mixed signal. The method comprises: selecting respective training data for each source audio signal; for respective segments of data representing the mixed signal, determining which combination of a respective segment from each of the selected training data matches the respective segment of the data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing the mixed signal. The method facilitates separating component speech signals from a single channel mixture of multiple speech signals.

Description

Method and Apparatus for Acoustic Source Separation
Field of the Invention
The present invention relates to the separation of component acoustic sources from a single channel mixture of acoustic sources. The invention relates particularly to the separation of simultaneous multi-talker speech from a single channel recording.
Background to the Invention
In real- world scenarios, speech rarely occurs in isolation and is usually accompanied by a background of many other acoustic sources. One very common scenario is multi-talker speech in which two or more speakers speak
simultaneously. A single microphone recording of this multi-talker speech will result in a mixed speech signal, and is generally termed single-channel mixed speech. Separating the component speech signals from a single channel mixture is a challenge.
Current state-of-the-art approaches addressing the problem of single-channel speech separation can be broadly grouped into three major categories: basis- function based decomposition, computational auditory scene analysis (CASA), and model-based approaches. In basis-function based decomposition, a set of bases is used to represent the short-time spectra of each component speech.
Separation is usually performed on a frame-by- frame basis, by finding a linear combination of the component basis sets that matches a given mixed speech frame. However, a frame of speech is generally very short ranging from 10 ms to 30 ms. The changing in sound properties over frames (i.e., time) are referred to as "temporal dynamics". Analysis over a small number of frames gives short-term temporal dynamics, while analysis over a larger number of frames gives long-term temporal dynamics. Long term temporal dynamics of a speech utterance is one of the most important characteristics of the utterance that distinguishes the utterance from noise, including other speaker's utterances. Separation of speech
on a frame-by- frame basis without considering the temporal dynamics is not effective. CASA is another category of algorithms, which segments
psychoacoustic cues (e.g., pitch, harmonic structures, modulation correlation, etc.) of the mixed speech into different component sources, and performs separation by masking the interfering sources. The method can only capture short-term speech temporal dynamics and hence suffers from ambiguity in correctly classifying the psychoacoustic cues.
In model-based approaches, each component speech signal is first represented by a statistical model using training samples from the component speaker. The mixed speech signal can then be represented by a model by combining the statistical models of the component speech signals. Model-based
algorithms can be divided into two groups: those assuming independence between speech, and those attempting to model variable levels of temporal dynamics. Techniques capable of capturing long-term temporal dynamics of speech demonstrate good performance for separation. However, in current techniques for speech separation, modeling long-term temporal dynamics (for example, on the subword, word or sentence level) requires knowledge of the task (vocabulary and grammar) and transcription of the training data. This limits both the applicability and the accuracy of the current techniques for everyday speech processing.
Separating speech without this knowledge remains an open research question.
It would be desirable therefore to provide improved techniques for acoustic source separation and in particular separation of separate speech sources. Summary of the Invention
A first aspect of the invention provides a method of separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said method comprising: selecting respective training data for each source audio signal; for respective segments of data representing said mixed signal, determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.
A second aspect of the invention provides an apparatus for separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said apparatus comprising: training data selecting means configured to select respective training data for each source audio signal; a segment analyzer configured to, for respective segments of data representing said mixed signal, determine which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing means configured to reconstruct each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal. A third aspect of the invention provides an audio signal separation system comprising the apparatus of the second aspect of the invention and an acousto- electric transducer for creating a said single channel mixture of audio signals from a plurality of simultaneous acoustic signals. A fourth aspect of the invention provides a computer program product comprising computer usable code for causing a computer to perform the method of the first aspect of the invention. Preferred features are recited in the dependent claims.
Preferred embodiments of the invention provide a method for separating component acoustic sources from a single channel mixture of said component acoustic sources, with particular application to single-channel speech separation.
Preferred embodiments are referred to herein as CLOSE (Composition of LOngest SEgments). In the context of speech separation, given a single-channel mixed speech signal and training data for one or more relevant component speaker classes, the preferred CLOSE method finds the longest segment compositions between the mixed speech signal and the training data for performing separation. This maximizes the extraction of the temporal dynamics for separation, without requiring knowledge of the task vocabulary, grammar and transcribed training data. Two different possible ways to realize the CLOSE method: CLOSE- 1 and CLOSE-2, are disclosed hereinafter by way of illustration. In CLOSE- 1, all the acoustic sources are separated at the same time while in CLOSE-2 one source is separated at a time. In terms of computation speed, CLOSE-2 is faster than CLOSE- 1.
Brief Description of the Drawings
Embodiments of the invention are now described by way of example and with reference to the accompanying drawings in which:
Figure 1 is a block diagram illustrating a system for separating multiple acoustic Sources, in particular speech; Figure 2 is a block diagram illustrating a preferred acoustic source separation process embodying the invention;
Figure 3 is a schematic illustration of said preferred acoustic source separation process in the context of two-talker speech separation;
Figure 4 presents an algorithm suitable for implementing said preferred acoustic source separation process in the context of two-talker speech separation (CLOSE- i);
Figure 5 presents an alternative algorithm for implementing said preferred acoustic source separation process in the context of two-talker speech separation (CLOSE-2); and Figure 6 is a schematic illustration of a preferred reconstruction process suitable for use with embodiments of the invention.
Detailed Description of the Drawings Referring now to Figure 1 of the drawings, there is shown generally indicated as 10, a digital signal processing system in the form of an acoustic source separation system, and in particular a speech signal separation system. The system 10 receives acoustic signals 12 from multiple (K) acoustic sources Si,... ,SK, which in this example are assumed to be human speakers although they could be any source of speech signals. The speech signals 12 are received by a microphone 14 (or other acousto-electric transducer or audio signal recording device) to produce a single channel mixture (SCM) of the speech signals (or more particularly of an electronic representation of the acoustic speech signals as produced by the microphone). The single channel mixture is fed into a separator A10, which separates the component acoustic sources to produce acoustic output signals
Oi, ... ,Οκ· The separation is performed on the electronic representation of the mixed signals and reproduced as acoustic signals by speakers 16, or other suitable transducers. Typically the SCM comprises digital audio data organized in frames. The output signals of the separator A10 typically also comprise digital audio data organized in frames.
The separation process performed by separator A10 is a multistage process which is illustrated in Figure 2. First, the SCM is subjected to a preprocessing stage Al 1. In this stage, each frame of the input data undergoes extraction of its spectral features (which may be referred to as spectral or frequency transformation). This results in a time sequence of spectral features (TSSF). Spectral features typically comprise data representing the frequency component(s) of the signal, e.g. data identifying the frequency component(s) and the respective magnitude and/or phase of those component(s). For the SCM the TSSF is denoted by TSSFSCM- By way of example, in a specific implementation of the preprocessing stage Al 1 for speech separation, the SCM may be first divided into short-time frames of, for example 20 ms, with a frame period of, for example, 10 ms (or otherwise sampled). Each frame may then be represented in the form of log spectral magnitudes. The invention is not limited to any specific spectral feature set or time domain to frequency domain transformations. Some suitable known methods for representing the spectral features include: squared amplitude; cepstrum; Mel- frequency cepstral coefficients (MFCC); Log magnitude spectrum; and Linear predictive coding (LPC)
Further, the phase information of the SCM is also extracted and stored for the later use in a source reconstruction stage A14. This is conveniently achieved as part of the spectral transformation of the SCM. Typically, a Fourier transform is used to produce the respective amplitude and phase spectrum. The phase spectrum obtained for the test speech (SCM) is used during the reconstruction process to produce the separated component speech signals from the mixed speech signals. The separator A10 is operable with training data, which may be received from an external source or may be stored by the system 10 in any suitable storage device 20 accessible by the separator A10. The training data comprises audio data signals, or derivatives thereof, against which the SCM can be compared.
Typically, the training data comprises, or is derived from, digital audio data signals, preferably organized in frames.
The training data is organized into a plurality of acoustic source classes Tls...,TN. Each acoustic source class comprises one or more respective training audio data signals, or data derived therefrom, from one or more acoustic sources.
Advantageously, signals from acoustic sources regarded as being sufficiently similar to one another are included in the same acoustic source class (for example, a single speaker; a collection of speakers of the same gender; a collection of speakers for a particular ethnic group; a particular musical instrument; a particular music genre etc.). In typical embodiments, the acoustic source(s) for the training data are not the same as the acoustic sources of the signals that make up the SCM, i.e. the separator A10 does not require training data from the actual acoustic sources whose mixed signals it is separating. The training data for each acoustic source class are subjected to a modelling process A12, to yield, for each acoustic class, a respective model for the time sequences of spectral features for the respective training data: TSSF-Mi,...,TSSF- M . Typically, the training data comprises multiple audio training signals (also referred to as training utterances or training data components) for each class. In the preferred embodiment, the respective audio training signals (or more particularly a frequency representation thereof) for each class are modeled collectively (per class) to provide a single model for each class. The model for each class comprises a respective model component for each of the respective audio training signals. Typically, the modelling process A12 comprises a combination of statistical modelling and data driven modelling. The following description of the preferred modeling process A12 is set in the context of speech separation. However, those skilled in the art will recognize that the present invention can be employed for separating any kind of simultaneous acoustic sources from a single channel mixture containing multiple simultaneous acoustic sources.
In typical embodiments, the modeler A12 applies a mixture model, preferably a Gaussian Mixture Model (GMM), to the training data of each class. Alternatively, other models, especially probabilistic models may be used. The preferred modeling process involves subjecting the training utterances for each class to spectral (frequency) transformation (typically producing a respective spectral feature vector for each frame), and applying mixture modeling to all of the resulting spectral data to produce a respective model (in this case a GMM) for each acoustic training class. This typically involves fitting a set of multi- dimensional Gaussian, or other, functions to the spectral data for each class to produce the mixture model. This is the aforementioned statistical modeling. Then data driven modeling is used to produce a respective model component (utterance model) for each of the training utterances of the class. Let χλ = {xX t : t = 1,2,..., T } represent a training utterance 34 for speaker class λ , where T is the number of frames in the utterance and xx t is the spectral feature vector of the frame at time t . Denote by Gk the Gaussian mixture model (GMM) for speaker class λ , of Μλ Gaussian components, trained using all the training utterances χλ . This can be expressed as
Figure imgf000009_0001
where λ (x \ m) is the m Gaussian component and wx (m) is the corresponding weight, for speaker class λ . By way of example, each speaker class GMM may contain 512 Gaussian components with diagonal covariance matrices. Next, based on G^ a data driven model for each training utterance χλ may be built by taking each frame from χλ and finding the Gaussian component in Gk that produces maximum likelihood for the frame, i.e. identifying the Gaussian component that is the most likely to match the spectral features of the frame. Thus, χλ can be alternatively represented by a corresponding time sequence of Gaussian components {gx (x \m-k t) : t = \,2,...,T } , (or other mixture model components) where mx t is the index of the Gaussian component producing maximum likelihood for the frame xX t . This Gaussian sequence representation of χλ , thus, can be fully characterized by the corresponding index sequence ιηλ , mx = {mxy. t = i,2,..,Tx ( 2 )
Equation (2) is called an utterance model which can be considered as part of the TSSF-M¾, of the training data of speaker class λ . In the modeling stage A12, an utterance model ιηλ is created for each training utterance χλ of each speaker class λ . All the training utterances models ιηλ for speaker class λ together form the TSSF-M¾, for the speaker class.
Given the SCM, containing the mixture of K simultaneous signals from acoustic sources Si,... ,SK, the TSSF-Mi,...,TSSF-MK required for the source separation are extracted from the TSSF-MI,...,TSSF-MN. Extraction can be manual or by means of algorithms. In the absence of knowledge of the acoustic source classes in the SCM, any source identification and clustering algorithms known in the art can be used to identify TSSF-Mi,...,TSSF-MK. For example, known speaker clustering methods such as those disclosed in United States Patents US5983178 and US5598507 may be used.
The single channel mixture TSSFSCM and the training data TSSF-Mi,...,TSSF-MK are then analyzed in order to identify the longest segment compositions (of the training data), and the corresponding matching component segments (of the TSSFSCM), for use in separation. In Figure 2, this analysis is performed by analyzer A13, which is labeled CLOSE (Composition of LOngest SEgments). As is described in more detail hereinafter, the identified matching component segments are used by a reconstruction process A14 along with the phase of the SCM to reconstruct the component acoustic sources Oi, ... ,Οκ·
There follows a description of a preferred embodiment of the analyzer A14, set in the context of speech separation from an SCM containing audio signals generated from simultaneous speech from two human speakers. It will be understood, however, that the invention may be employed for separating signals from any kind of simultaneous acoustic sources from a single channel mixture containing signals from the multiple simultaneous acoustic sources. This is achievable provided training data for each relevant acoustic source class is available. The preferred method for two-talker speech separation is illustrated in Figure 3 and, in the preferred embodiment, is implemented by analyzer A14. A single- channel mixed speech signal 30 (which may also be referred to as the test utterance) is composed of two simultaneous speech utterances (audio signals) from two speakers. Training data 32 comprising a respective plurality of training utterances 34 for speaker classes λ and γ is available. It is assumed that the speaker classes λ and γ are relevant to the speakers and have been selected in the manner above.
Figure 3 illustrates how a test segment ytT of the test utterance 30 is subjected to signal separation by finding a respective segments ιηλ hj and my u.v from the training utterances 34 of speaker classes λ and γ such that the selected training segments, when combined, match (or substantially match, i.e. are deemed to constitute an acceptable match) the test segment. Preferably, matching in this context involves finding the training segments that, when combined, give the highest probability (or maximum likelihood) of representing the test segment. This is referred to as segment composition.
In preferred embodiments, each speaker is associated with a respective speaker class. Hence, N speaker classes are used for N speakers. However, the same speaker class can be used for more than one speaker and so N different speaker classes need not necessarily be used. In the general case, if there are N speakers, N respective training utterances are combined during segment composition, each training utterance belonging to the class of the respective speaker.
The preferred CLOSE method of A14 maximizes the length of the segment composition as a means of minimizing the uncertainty of the component segments in the composition and hence the error of separation. Advantageously, it is the length of the test segment y x that is maximized, and this in turn maximizes the lengths of the composite training segments j and my u.v that match ytT . (see equation (8) hereinafter). In preferred embodiments, a probabilistic approach is used as one possible way of finding the matching score for the longest segment composition. However, the method described for finding the matching score should not be considered exclusive.
By way of example, two different embodiments for determining longest segment compositions are now described, being referred to as CLOSE- 1 and CLOSE-2. These are described in the context of two-talker speech separation for illustrative purposes.
CLOSE- 1 is described first, set in the context of speech separation from SCM containing simultaneous speech from two speakers. Consider y = {yt : t = 1,2,..., T) as a test utterance with T frames, composed of two simultaneous speech utterances spoken by two different speakers, and let λ and γ denote the corresponding speaker classes selected from the training data. Let a = {ye : ε = t, t + 1,...,τ } represent a test segment taken from the test utterance y and consisting of consecutive frames from time t to r. Let ιηλ .j = {mX t : t = i,i + represent a training segment taken from the model ιηλ and modelling consecutive frames from i to j in the training utterance χλ of speaker class λ . Similarly, let my ,u:v = imy ,t '■ t = u, u + \,...,v} represent a training segment taken from the model my and modelling consecutive frames from u to v in the training utterance χγ of speaker class γ . Given ya , two matching training component segments m ;i, . and my u.v are identified by using the posterior probability P(mx .j , my u.v | yiT ) . In this context the posterior probability is the (new) probability given a prior probability and some test evidence, in this case a test segment, and wherein the probability is a measure of the likelihood that a combination of the training component segments matches the test segment.
Assume an equal prior probability P for all possible component segments (s ,s ) from the two speaker classes. This posterior probability can be expressed as
Figure imgf000013_0001
where p(ya | m ;i. ., m^;M:v) is the likelihood that the given test segment ya is matched by combining the two training component segments , my u.v .
Assuming, for convenience, independence between the frames within a segment, this segmental likelihood function can be written as
τ
P(yt-, \ m ,i:j ^i ,u:v ) = { p(ys ,ς (ε ) , ^γ ,η (ε ) ) where p(yE | ¾;ς (ε ) , «γ η (ε )) is the likelihood that the given test frame yE is matched by combining the two training component frames mx (ε ) , my (ε ) . In (4), ς (ε ) and η (ε ) represent time warping functions between the test segment a and the two training segments ιηλ ;, . and my u.v in forming the match. The use of time warping functions is optional and allows for variation in the rate of speech.
In equation (3), the denominator is expressed as the sum of two terms. The first term is the average likelihood that the given test segment ytx is matched by
combining two training speech segments; this likelihood is calculated over all possible training segment combinations between the two speaker classes. The second term, denoted by p(ya | ) , represents the average likelihood that the test segment a is matched by two speech segments which, either or both, are not seen in the training data. This likelihood associated with unseen component
segments can be expressed by using a mixture model, allowing for arbitrary, temporally independent combinations of the training frames to simulate arbitrary unseen speech segments. Combining the two speakers classes GMMs [i.e.,
equation (1)], the following expression can be used:
∑ ∑ wx (mx K (my )p(ys \ ηι , ηΐΊ )
m =1 m,f =1
(5) The sums inside the brackets form a mixture likelihood for the test frame ys , taking into consideration of all possible training frame combinations between the two speaker classes; equation (5) assumes independence between consecutive frames, to simulate arbitrary component/test segments. In other words, if the
segmental temporal dynamics are regarded as "text" dependence, then equation
(4) gives a "text-dependent" likelihood of the test segment, dependent on the
temporal dynamics of both training component segments, while (5) gives a "text- independent" likelihood of the test segment. In this context, "text-independence" may be regarded as matching single frames while "text-dependence" may be regarded as matching multi- frame segments. Test segments with mismatching training component segments will result in low "text-dependent" likelihoods [i.e., (4)] but not necessarily low "text-independent" likelihoods [i.e., (5)], and hence low posterior probabilities of match [i.e., (3)].
For matching composition of the test segment yt with the training component segments m ;i, . and my u.v it can be assumed that the "text-dependent" likelihood is greater than the "text-independent" likelihood, i.e.,
Ρ(Ύη I m,— m )≥ P( t, Ι Φ* ) · This is because
τ
P(y I Φ ) - Π max max )w (m )p(yE \ mx ,m )
E=t *»■ ^
Figure imgf000015_0001
τ
≤Y\ p(yE \ m q (E ), my n(E))
E=t
(6)
The second approximation is based on the assumption that matching and hence highly likely test-training composition dominates the mixture likelihood.
Therefore, with (3) and (5), a larger posterior probability can be obtained for
composition between matching test/training segments, and a smaller posterior probability for composition between mismatching test/training segments.
The posterior probability formulation (3) has another advantageous characteristic:
it favours the continuity of match and produces larger probabilities for
composition between longer matching segments. Assume that the test segment
ya and the two training component segments and my u.v are matching, in the sense that the segmental likelihoods of composition
Ρ(Ύκ \ m i:j > my ,u-,)≥P(yt, I
Figure imgf000015_0002
) for any
( m',..-: , m' ,u-.v )≠ (n , , m ,u-, ) , and p(y ίπ \ mx .:j , m u.v ) > p(yfI | ) . Then the inequality concerning the posterior probabilities for compositions between matching segments with different lengths can be expressed as:
« (6 ) - m y W (a) l yJ≤JpK,, - m y , l y* ) where yt.E , with ε≤τ , is a test segment starting at the same time as ya but not lasting as long and ηιλ ;:ς(ε) and rn^ (ε) are the corresponding training component subsegments matching the shorter test segment yre . The inequality indicates that larger posterior probabilities are obtained when longer matching segments are being composed. At each frame time t of the test utterance y = {yt: t = 1,2,..., T) , a longest segment composition can be found, denoted by the test segment yt. starting at t and the corresponding matching training component segments
;, . and u.v , by maximizing the posterior probability. That is „m > (mLv > 111 ,u:v ) = arB maX , J m,.j,,a V
τ m X mX .
1 ^,.v, ^m', , : · > 111 ' ,u- I )
That is, at time t, the longest segment composition may be obtained by first finding the most-likely training component segments for each fixed-length test segment ya , and then finding the maximum test-segment length (i.e., ) that results in the maximum posterior probability.
CLOSE- 1 for two-talker speech separation is outlined as an algorithm in Figure 4. In CLOSE- 1, both training component frames/segments for each test
frame/segment are constrained temporally by their corresponding longest matching training component segments. To find the two matching training component segments for a test segment, this system needs to search Νλ x Ny possible combinations, where Λ^λ and Ny represent the number of training segments from speaker class λ and speaker class γ , respectively. The second embodiment, CLOSE-2, is now described set in the context of speech separation from SCM containing simultaneous speech from two speakers.
Reconsider the segmental likelihood function (4) of a test segment a , now
associated with a temporally constrained training component segment ιηλ ;, . from speaker class λ , and a temporally unconstrained training component segment from speaker class γ . Denote the unconstrained component segment as *γ tx . This likelihood function can be expressed as p(y Ι ^Λ^ ) = Π ™χ p( * Κ (9)
The unconstrained component segment is formed by choosing the frames freely from the training data (mapped to GMM Gy ), to maximize the likelihood with the constrained component segment. Thus, * ίπ = (my t , my t+l ,...,my x ) where each ill ., = argmax^≤M^ p(ya \ ιηλ Λ (β ) , ιηγ ) . Equation (9) gives a "text-dependent" likelihood of the test segment, dependent on the temporal dynamics of the training component segment ιηλ ;. . . Substituting p{ytx | m ;i. . ,* tx ) into (3) we can obtain the posterior probability expressed as: m, * I v ) ) =
Figure imgf000017_0001
Equation (10) is only a function of the temporally constrained training segment ιηλ ;. . . Similar to (8), the longest segment compositions can be located between the test segments and the temporally constrained training segments, by
maximizing the posterior probabilities. At each frame time t, the longest matching test segment yrt max and the corresponding temporally constrained training segment ;, . are obtained by first finding the most-likely training segment for each fixed-length test segment tx , and then finding the maximum test-segment length (i.e., ) that results in the maximum posterior probability
Figure imgf000018_0001
(11)
Equation (11) shows the estimation of the temporally constrained component
segments for speaker λ . By switching the temporal constraint from speaker class λ to speaker class γ , the same system can be used to identify the temporally
constrained component segments for speaker class γ . Figure 5 outlines CLOSE-2 for two-talker speech separation. As described above, each time, only one of the matching training segments forming the segment composition is subjected to
temporal constraints. CLOSE-2 has a search complexity of about Nx + Ny
possible combinations, which can be significantly less than Νλ x N required for
CLOSE- 1, for large numbers of training segments Nx and Ny .
In the above approach, (4), (5), (9), the likelihood of a test frame associated with two training component frames, p(yt \ m ,m ) is calculated (i.e. the likelihood that a combination of the training frames matches the test frame), where mx and my each correspond to a Gaussian component in the appropriate speaker class's
GMM, i.e., λ (x \mx) and gy (x \m ) , which model the probability distributions of the training component frames. Given the probability distributions of the training component frames, and given the assumption that the test frame yt is an additive mixture of the training component frames, there can be several methods, for
example, log-max, Algonquin, lifted max, or parallel model combination, that can be used to derive the likelihood for the test frame, i.e. the likelihood that a
combination of the training frames matches the test frame.
Conveniently, a simple method, the log-max model, may be used to obtain this likelihood. However any algorithms known in the art can be used to calculate the likelihood p(yt \ m ,m ) . For each frame, its log power spectrum is calculated as the spectral feature. Assume that the log power spectrum of yt can be expressed in F distinct frequency channels, i.e., y = {yt : f = ,2,...,F} where ytf is the log power of the / th channel. Then p(yt \mx,my) can be expressed as
Figure imgf000019_0001
p(yt\mx,my) = Y[p(ytf \mx,my) where pyt \mx,my) is the likelihood of the log power of the / th channel. For simplicity, in (12) independence between the frequency channels is assumed. Let xXf and x f represent the log powers of the same channel of the two component frames, subject to probability distributions gx(x \mx) and gy (x \m ) . Assume ytf-max(xXf,xy f) . Thus, pyt \mx,my) can be written as
P(y, \™x,my) = gx (ytf \mx )Py (ytf \my ) + gy (ytf \my )PX (ytf \mx )
where Py (yt f \my ) = gy (xf \my )dxf , and likewise for Px (yt f \mx) .
Once the respective matching training segments are identified, they may be regarded as comprising, or being representative of, the separated speech signals. In this example, two training segments are identified (one for each speaker class), and these segments may be considered as constituting the separated speech signals (or more particularly as constituting data from which the respective audio source signal can be reconstructed).
For the purposes of separation, component speaker classes/frames are modelled with gains different from the training data. Rewrite the component-frame
Gaussians as gx(x\mx,ax) and gy (x \m ,a ), where ax and ay are the gain updates (in dB) for speaker class λ and speaker class γ , respectively, and
Figure imgf000019_0002
mi + x,∑mi ), where μ¾ and∑¾ are the training-data based mean vector and covariance matrix of the appropriate Gaussian. For any given test utterance, the gain updates ax and ay are calculated at the frame level on a frame-by- frame basis, by maximizing the test frame likelihood p(yt \ m ,m ) against a set of predefined gain update values for each component frame. The
gain-optimized test frame likelihood can be expressed as
F
p(yt \ mx ,my ) = max Y[ p(yt f \ mx , my ,ax >ay ) (14) where g^ and g are the predefined gain-update value sets for speaker class λ and γ , and p yt \ mx , my , ax , ay ) is the local channel likelihood (13) with each component Gaussian including a corresponding gain update. It is to be noted
again that the above-described probabilistic approach (including gain
identification) is only one of the approaches that can be used for realising the
CLOSE method.
When separation is complete, the analyzer A13 produces, as an output, a
respective plurality, or set, of selected training segments for each audio source contributing to the SCM. Each set of training elements may be said to form a
sequence in that it comprises a respective training element for successive
segments, and in particular frames, of the SCM. This is illustrated schematically in Figure 6 where successive frames of the SCM (and in particular the TSSFSCM) are represented by vertical lines 60. For each frame, there is a training segment 62 selected by the analyzer A13 as best matching segment starting from that frame.
As described above, each training segment is selected from one or more of the training audio signals 34 of the training data 32. In preferred embodiments, the respect set of training segments for each audio source comprises a sequence of training segments taken from at least one, and typically only one, acoustic class associated with the respective audio source. A respective training segment is
typically selected (in the "best matching" manner described above) for each
frame, or other segment, of the SCM (test utterance). Hence, in the preferred
embodiment, for each audio source the analyzer produces a respective sequence of training data segments taken from one or more the training audio signals of the acoustic class associated with the respective audio source, a respective training data segment being selected for each frame of the SCM by determining the best matching segment composition for that frame. In Figure 6, the training segments 62 for a given audio source are aligned temporally, conveniently by frame. As can be seen from Figure 6 each training segment 62 is aligned according to its start time, conveniently the start time (as indicated by the vertical lines 60) of the frame for which it is selected. The training segments 62 may be of different lengths and may, for example, extend across more than one frame. As a result, one or more of the training segments 62 may overlap with one another when aligned temporally, in this case by frame. The training segments 62 may therefore be said to overlap temporally when aligned. Accordingly, a respective portion of one or more of the training segments 62 may be aligned with each frame of the TSSFSCM (as represented by the vertical slices between lines 60 in Figure 6).
The output from the analyzer A13 is fed to the reconstruction module A14 to estimate the individual audio component sources in the SCM. Figure 6 shows the preferred reconstruction process schematically. The best matching training segments 62 produced by the analyzer A13 are first aligned with one another, conveniently by starting frame. For each frame of the TSSFSCM, the respective spectral feature(s) of the portion of the, or each, training segment 62 that is aligned with said frame are combined to produce corresponding composite spectral feature(s) for the corresponding frame of the separated audio signal (this is represented in Figure 6 by the merging module A50). Conveniently, the composite spectral feature(s) are produced by averaging the, or each, training segment spectral feature(s) across each frame, although other combining functions may be used. This results in a respective set of composite spectral feature(s) for each frame of the separated signal. The spectral features typically comprise data representing the frequency component(s) of the segment portion, e.g. data identifying the frequency component(s) and the respective magnitude of those component(s).
In order to reconstruct the separated audio signal in the time domain, frequency to time domain conversion is required (post-processing module A51). Conversion from the composite spectral features to a corresponding time domain audio signal conveniently uses the previously obtained phase information in the inverse spectral transformation corresponding to the spectral transformation used in the pre-processing module Al l .
The following description provides the details of a specific implementation of the reconstruction process, set in the context of speech separation for rebuilding component utterances (from the audio sources contributing to the SCM) based on the longest matching segments found at all the frame times corresponding to the source. Given test utterance y = { yt : t = 1,2, ... , T) , after finding the matching segments m^, . and m' .v at all t [i.e. (8) or (11)], m^, . and m' .v are used respectively to form estimates of the two component speech utterances forming the test utterance. In the following, the algorithm that uses ;, . to estimate the component utterance from speaker class λ is described. The same algorithm can be used to estimate the component utterance from the other speaker class, by replacing m^. with m^ .
Let sx ε represent the component frame of the component utterance of speaker class λ at time ε , ε = 1,2,...,Γ, and SX a be the magnitude spectrum of the frame. An estimate of Sx ε can be obtained by taking all the matching training segments that contain sx ε and averaging over the corresponding training frames (a way of merging the matching segments) A50. The following equation can be used: =∑ - Ηη , .. (ε ) )J ex (fl^ (ε ) )P( ,i:j , 111 ,„:ν I Υ ,τ m 'P t
(15) where the sum is over all test segments yi that contain component frame sx ε ; τη' χ,ς (ε ) , is the training frame and αλ' {ε) is the associated gain [obtained using
(1 )] corresponding to sx ε , taken from the longest matching training segment
. j ; Α(ηι'λ,ς (ε ) ) represents a magnitude spectrum associated with training frame . As shown in (15), each component frame is estimated through
identification of a longest matching component segment, and each estimate is
smoothed over successive longest matching component segments. This improves both accuracy for frame estimation and robustness to imperfect segment match.
Frames within the same segment share a common confidence score which is the posterior probability of the segment. In (15), P is a normalization term. In typical embodiments, the following expression is suitable:
I1 tf ∑^ ,, , m ^ | yrtmJ < 1
The last condition prevents small posterior probabilities being scaled up to give a false emphasis. If CLOSE-1 (11) is used, the posterior probabilities in (15) and
(16) should be replaced by Ρ ΤΆ'Χ .] *ί^ \ γ,.τ ) , as defined in (11). The merging process A50 for each component source results in the formation of the TSSF of the source which is then passed to a post-processing module A51 for reconstructing the source signal, as shown in Figure 6. By way of example, the
DFT magnitudes of the training frames may be used as the magnitude spectra
A(mx ; ) and the phase of the SCM to form the estimate of the source signal.
However, the configuration mentioned here should not be considered exclusive.
One skilled in the art would recognize other possible configurations for the postprocessing stage to estimate the component source signal. The separator A10 may be implemented in hardware and/or computer program code as is convenient. In particular, any one or more of the pre-processing Al 1 , modeling A12, CLOSE A13 and reconstruction A14 may be implemented in hardware and/or computer program code as is convenient or best suits a given application.
The invention is not limited to the embodiments described herein which may be modified or varied without departing from the scope of the invention.

Claims

CLAIMS:
1. A method of separating a plurality of source audio signals from a mixed signal comprising said source audio signals in an audio signal processing apparatus, said method comprising: selecting respective training data for each source audio signal; for respective segments of data representing said mixed signal, determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.
2. A method as claimed in claim 1, further including organizing said training data into a plurality of classes; and associating each source audio signal with one of said classes, wherein said selecting training data involves selecting training data from the class associated with the respective audio source signal.
3. A method as claimed in claim 2, wherein each class comprises at least one training data component, each training data component comprising, or being derived from, a respective audio training signal.
4. A method as claimed in any preceding claim, wherein said training data comprises, or is derived from, audio signals generated by one or more sources other than the sources of said audio source signals.
5. A method as claimed in any preceding claim, further including transforming said mixed signal to produce a time sequence of spectral data representing said mixed signal in the frequency domain, and wherein said segments of said mixed signal comprise segments of said transformed mixed signal.
6. A method as claimed in any preceding claim, further including modeling a plurality of audio training signals, or derivatives thereof, to produce said training data.
7. A method as claimed in claim 6, wherein said training data comprises a plurality of training data components each training data component comprising, or being derived from, a respective audio training signal, and wherein said modeling involves producing a respective model component for use as each training data component.
8. A method as claimed in 6 or 7, further including statistically modeling the respective audio training signals, or derivatives thereof, to produce at least one statistical model.
9. A method as claimed in claim 8 when dependent on claim 2, further including statistically modeling the respective audio training signals, or derivatives thereof, of each class to produce a respective statistical model for each class.
10. A method as claimed in claim 8 or 9, wherein said statistical model comprises a probabilistic model, preferably a mixture model, and most preferably a Gaussian mixture model.
1 1. A method as claimed in any one of claims 8 to 10 when dependent on claim 7, further including producing respective model components for use as said training data components by fitting said at least one statistical model to the respective audio training signals, or derivatives thereof.
12. A method as claimed in claim 11 when dependent on claim 9, further including, for each class, producing respective model components for use as the training data components of the respective class by fitting the respective statistical model for the class to the respective audio training signals, or derivatives thereof.
13. A method as claimed in claim 11 or 12, wherein said fitting involves finding the component of the respective statistical model that is statistically most likely to match the respective audio training signal, or derivative thereof.
14. A method as claimed in any one of claims 6 to 13, further including transforming said audio training signals to produce a respective time sequence of spectral data representing the respective audio training signal in the frequency domain, and modeling said spectral data to produce said training data.
15. A method as claimed in claim 3, wherein selecting respective training data for each source audio signal involves selecting at least some and preferably all of the training data components of the respective class.
16. A method as claimed in any preceding claim, wherein said training data comprises a plurality of training data components each training data component being derived from a respective audio training signal, and wherein each of said segments of training data comprises a segment of any one of said training data components.
17. A method as claimed in any preceding claim, wherein determining which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal involves for each segment of said data representing said mixed signal:
creating a plurality of composite training segments, each composite training segment comprising a different combination of training data segments, one from each of the respective selected training data; calculating, for each composite training segment, a measure of the similarity of said composite training segment to said respective mixed signal data segment;
selecting the composite training segment with the highest similarity measure as a match for said respective mixed signal data segment.
18. A method as claimed in claim 17, further including determining, for each selected composite training segment, the maximum length of said respective mixed signal data segment by increasing the length of said respective mixed signal data segment while matching its constituent training segments, up to the respective maximum length.
19. A method as claimed in claim 17 or 18, wherein calculating said measure of similarity involves calculating a probability that the respective test segment is matched by the respective composite training segment.
20. A method as claimed in any one of claims 17 to 19, wherein calculating said measure of similarity involves calculating a posterior probability of said composite training segment being a match given said respective mixed signal data segment.
21. A method as claimed in claim 20 when dependent on claim 18, wherein determining the maximum length of one or more of said constituent training data segments involves determining the maximum length of said mixed signal data segment that maximizes said posterior probability.
22. A method as claimed in any preceding claim, wherein said reconstructing involves, for each audio source signal, temporally aligning the segments of the respective selected training data that form a respective matching combination for each mixed signal data segment; combining the temporally aligned segments to produce a time sequence of data components representing said audio source signal.
23. A method as claimed in claim 22, wherein said temporally aligned segments are combined by applying an averaging or smoothing function.
24. A method as claimed in claim 22 or 23, wherein the beginning of each segment is aligned with a respective frame of said mixed signal data.
25. A method as claimed in claim 24, wherein said combination aligned segments involves combining respective portions of one or more of said segments that are aligned with respective frames of said mixed signal data.
26. A method as claimed in any preceding claim, wherein said training data comprises spectral data.
27. A method as claimed in claim 26, when dependent on any one of claims 22 to
25. wherein said time sequence of data components comprises a time sequence of spectral features, and wherein said reconstruction of said respective audio source signal involves performing an inverse frequency transform on said time sequence of spectral features.
28. A method as claimed in any preceding claim, further including adjusting the gain of said segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal based on a set of one or more gain update values.
29. A method as claimed in claim 28, wherein said gain update values are derived from said mixed signal data.
30. A method as claimed in any preceding claim, wherein mixed signal comprises a single channel mixture of simultaneous source audio signals from respective audio sources.
31. A method as claimed in claim 30, wherein said single channel mixture is created by an acousto-electric transducer from a plurality of simultaneous acoustic signals.
32. A method as claimed in any preceding claim, performed in a digital signal processing system, especially an audio signal separation system.
33. An apparatus for separating a plurality of source audio signals from a mixed signal comprising said source audio signals, said apparatus comprising: training data selecting means configured to select respective training data for each source audio signal; a segment analyzer configured to, for respective segments of data representing said mixed signal, determine which combination of a respective segment from each of said selected training data matches said respective segment of said data representing said mixed signal; and reconstructing means configured to reconstruct each source audio signal using the segments of the respective selected training data that form a respective matching combination for each segment of said data representing said mixed signal.
34. An audio signal separation system comprising an apparatus as claimed in claim 33 and an acousto-electric transducer for creating a said single channel mixture of audio signals from a plurality of simultaneous acoustic signals.
35. An audio signal separation system as claimed in claim 34, further including respective electro-acoustic transducer for rendering each separated audio signal.
36. A computer program product comprising computer usable code for causing computer to perform the method of any one of claims 1 to 32.
PCT/EP2012/066549 2011-08-26 2012-08-24 Method and apparatus for acoustic source separation WO2013030134A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1114737.8 2011-08-26
GBGB1114737.8A GB201114737D0 (en) 2011-08-26 2011-08-26 Method and apparatus for acoustic source separation

Publications (1)

Publication Number Publication Date
WO2013030134A1 true WO2013030134A1 (en) 2013-03-07

Family

ID=44838736

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/066549 WO2013030134A1 (en) 2011-08-26 2012-08-24 Method and apparatus for acoustic source separation

Country Status (2)

Country Link
GB (1) GB201114737D0 (en)
WO (1) WO2013030134A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2940687A1 (en) * 2014-04-30 2015-11-04 Accusonus, Inc. Methods and systems for processing and mixing signals using signal decomposition
US9584940B2 (en) 2014-03-13 2017-02-28 Accusonus, Inc. Wireless exchange of data between devices in live events
US9812150B2 (en) 2013-08-28 2017-11-07 Accusonus, Inc. Methods and systems for improved signal decomposition
CN110544482A (en) * 2019-09-09 2019-12-06 极限元(杭州)智能科技股份有限公司 single-channel voice separation system
US10667069B2 (en) 2016-08-31 2020-05-26 Dolby Laboratories Licensing Corporation Source separation for reverberant environment
CN111798866A (en) * 2020-07-13 2020-10-20 商汤集团有限公司 Method and device for training audio processing network and reconstructing stereo
US10839309B2 (en) * 2015-06-04 2020-11-17 Accusonus, Inc. Data training in multi-sensor setups

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111370019B (en) * 2020-03-02 2023-08-29 字节跳动有限公司 Sound source separation method and device, and neural network model training method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598507A (en) 1994-04-12 1997-01-28 Xerox Corporation Method of speaker clustering for unknown speakers in conversational audio data
US5983178A (en) 1997-12-10 1999-11-09 Atr Interpreting Telecommunications Research Laboratories Speaker clustering apparatus based on feature quantities of vocal-tract configuration and speech recognition apparatus therewith
DE102007030209A1 (en) * 2007-06-27 2009-01-08 Siemens Audiologische Technik Gmbh smoothing process

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5598507A (en) 1994-04-12 1997-01-28 Xerox Corporation Method of speaker clustering for unknown speakers in conversational audio data
US5983178A (en) 1997-12-10 1999-11-09 Atr Interpreting Telecommunications Research Laboratories Speaker clustering apparatus based on feature quantities of vocal-tract configuration and speech recognition apparatus therewith
DE102007030209A1 (en) * 2007-06-27 2009-01-08 Siemens Audiologische Technik Gmbh smoothing process

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
G-J JANG ET AL: "Single-channel signal separation using time-domain basis functions", IEEE SIGNAL PROCESSING LETTERS, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 10, no. 6, 1 June 2003 (2003-06-01), pages 168 - 171, XP011433600, ISSN: 1070-9908, DOI: 10.1109/LSP.2003.811630 *
JI MING ET AL: "A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise", IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, USA, vol. 19, no. 4, 1 May 2011 (2011-05-01), pages 822 - 836, XP011352002, ISSN: 1558-7916, DOI: 10.1109/TASL.2010.2064312 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238881B2 (en) 2013-08-28 2022-02-01 Accusonus, Inc. Weight matrix initialization method to improve signal decomposition
US9812150B2 (en) 2013-08-28 2017-11-07 Accusonus, Inc. Methods and systems for improved signal decomposition
US10366705B2 (en) 2013-08-28 2019-07-30 Accusonus, Inc. Method and system of signal decomposition using extended time-frequency transformations
US11581005B2 (en) 2013-08-28 2023-02-14 Meta Platforms Technologies, Llc Methods and systems for improved signal decomposition
US9918174B2 (en) 2014-03-13 2018-03-13 Accusonus, Inc. Wireless exchange of data between devices in live events
US9584940B2 (en) 2014-03-13 2017-02-28 Accusonus, Inc. Wireless exchange of data between devices in live events
EP2940687A1 (en) * 2014-04-30 2015-11-04 Accusonus, Inc. Methods and systems for processing and mixing signals using signal decomposition
US11610593B2 (en) 2014-04-30 2023-03-21 Meta Platforms Technologies, Llc Methods and systems for processing and mixing signals using signal decomposition
US10468036B2 (en) 2014-04-30 2019-11-05 Accusonus, Inc. Methods and systems for processing and mixing signals using signal decomposition
US10839309B2 (en) * 2015-06-04 2020-11-17 Accusonus, Inc. Data training in multi-sensor setups
US10667069B2 (en) 2016-08-31 2020-05-26 Dolby Laboratories Licensing Corporation Source separation for reverberant environment
US10904688B2 (en) 2016-08-31 2021-01-26 Dolby Laboratories Licensing Corporation Source separation for reverberant environment
CN110544482A (en) * 2019-09-09 2019-12-06 极限元(杭州)智能科技股份有限公司 single-channel voice separation system
CN110544482B (en) * 2019-09-09 2021-11-12 北京中科智极科技有限公司 Single-channel voice separation system
CN111798866A (en) * 2020-07-13 2020-10-20 商汤集团有限公司 Method and device for training audio processing network and reconstructing stereo
CN111798866B (en) * 2020-07-13 2024-07-19 商汤集团有限公司 Training and stereo reconstruction method and device for audio processing network

Also Published As

Publication number Publication date
GB201114737D0 (en) 2011-10-12

Similar Documents

Publication Publication Date Title
EP3719798B1 (en) Voiceprint recognition method and device based on memorability bottleneck feature
US9892731B2 (en) Methods for speech enhancement and speech recognition using neural networks
US9536525B2 (en) Speaker indexing device and speaker indexing method
WO2013030134A1 (en) Method and apparatus for acoustic source separation
JP6437581B2 (en) Speaker-adaptive speech recognition
Hori et al. The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition
JP7342915B2 (en) Audio processing device, audio processing method, and program
Friedland et al. The ICSI RT-09 speaker diarization system
Aggarwal et al. Performance evaluation of sequentially combined heterogeneous feature streams for Hindi speech recognition system
Madikeri et al. Integrating online i-vector extractor with information bottleneck based speaker diarization system
Seo et al. A maximum a posterior-based reconstruction approach to speech bandwidth expansion in noise
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
US11929058B2 (en) Systems and methods for adapting human speaker embeddings in speech synthesis
Shahnawazuddin et al. Enhancing noise and pitch robustness of children's ASR
Devi et al. Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Mandel et al. Audio super-resolution using concatenative resynthesis
KR101023211B1 (en) Microphone array based speech recognition system and target speech extraction method of the system
Barai et al. An ASR system using MFCC and VQ/GMM with emphasis on environmental dependency
Liu et al. Robust speech enhancement techniques for ASR in non-stationary noise and dynamic environments.
Wu et al. Speaker-invariant feature-mapping for distant speech recognition via adversarial teacher-student learning
Chaudhari et al. Effect of varying MFCC filters for speaker recognition
Hurmalainen Robust speech recognition with spectrogram factorisation
Yang et al. A preliminary study of emotion recognition employing adaptive Gaussian mixture models with the maximum a posteriori principle
Müller et al. Noise robust speaker-independent speech recognition with invariant-integration features using power-bias subtraction
Padmini et al. Identification of correlation between blood relations using speech signal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12762233

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12762233

Country of ref document: EP

Kind code of ref document: A1