EP2543035A1

EP2543035A1 - Method for determining fundamental-frequency courses of a plurality of signal sources

Info

Publication number: EP2543035A1
Application number: EP11708975A
Authority: EP
Inventors: Michael Wohlmayr; Michael Stark; Franz Pernkopf
Original assignee: Technische Universitaet Graz
Current assignee: Technische Universitaet Graz
Priority date: 2010-03-01
Filing date: 2011-02-22
Publication date: 2013-01-09
Anticipated expiration: 2031-02-22
Also published as: EP2543035B1; US20130151245A1; WO2011106809A1; AT509512B1; AT509512A1

Abstract

The invention relates to a method for determining fundamental-frequency courses of a plurality of signal sources from a one-channel audio recording of a mixed signal, comprising the following steps: a) determining the spectrogram properties of the pitch states of individual signal sources using training data; b) determining the probabilities of the fundamental-frequency combinations of the signal sources contained in the mixed signal by combining the properties determined in a) by means of an interaction model; and c) tracking the fundamental-frequency courses of the individual signal sources.

Description

METHOD FOR DETECTING BASIC FREQUENCY FLOWS OF MULTIPLE

SOURCES

The invention relates to a method for determining fundamental frequency profiles of a plurality of signal sources from a single-channel audio recording of a mixed signal.

Methods for tracking or separating single-channel speech signals over the perceived fundamental frequency (the English term "pitch" is used in the following statements to mean the perceived fundamental frequency) are used in a number of algorithms and applications in speech and audio processing for example, single-channel blind-source separation (SCSS) (D.Morgan et al., "Cochannel Speaker Separation by harmonic enhancement and suppression", IEEE Transactions on Speech and Audio Processing, Vol. 407-424, 1997), Computational Auditory Scene Analysis (CASA) (DeLiang Wang, "On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis", P. Divenyi [Ed], Speech Separation by Humans and Machines, Kluwer Academic , 2004) and speech compression (R. Salami et al., "A great quality 8 kbps speech codec for the personal communications system (PCS)", IEEE Transactions o n Vehicular Technology, Vol. 43, pp. 808-816, 1994). Typical applications of such methods are, for example, conference situations, where during a lecture sometimes several voices are audible and thereby the recognition rate of an automatic speech recognition drops sharply. An application in hearing aids is possible.

The fundamental frequency is a fundamental quantity in the analysis, recognition, coding, compression and representation of speech. Speech signals can be described by the superimposition of sinusoidal vibrations. For voiced sounds such as Vowels is the frequency of these oscillations either the fundamental frequency or a multiple of the fundamental frequency, the so-called harmonics or harmonics. Thus, voice signals can be assigned to specific signal sources by identifying the fundamental frequency of the signal.

While a number of proven methods for estimating and tracking (tracking) the fundamental frequency are already in use in the case of a single speaker in low-noise recording, there are still problems in processing inferior (ie noise-like) noise ) Recordings of several people talking at the same time. Mingyang Wu et al. suggest a solution for robust multiple fundamental frequency tracking in multi-speaker recordings in "A Multipitch Tracking Algorithm for Noisy Speech" (Volume 11 Issue 3, pp. 229-241, May 2003) is based on the unitary model for fundamental frequency perception, for which various improvements are proposed in order to obtain a probabilistic representation of the periodicities of the signal.Tracing the probabilities of the periodicities using the Hidden Markov Model (HMM) enables the presentation of semicontinuous fundamental frequency characteristics The solution is firstly the high computational effort and the computer resources required as a result, on the other hand the fact that a proper assignment of the fundamental frequencies to the appropriate signal sources or speakers is not possible, the reason being that in this system no ne speaker-specific information or be available, which would allow such a combination of measured pitch values and speakers.

It is therefore an object of the invention to provide a method for multiple fundamental frequency tracking, which allows a reliable assignment of the determined fundamental frequencies to signal sources or speakers and at the same time has a low storage and computational intensity.

This object is achieved by a method of the type mentioned according to the invention by the following steps:

a) determining the spectrogram properties of the pitch states of individual signal sources using training data;

b) determining the probabilities of the possible fundamental frequency combinations of the signal sources contained in the mixed signal by combining the properties determined in a) by means of an interaction model;

c) Tracing the fundamental frequency characteristics of the individual signal sources.

Thanks to the invention, a high accuracy of the track of the multiple fundamental frequencies can be achieved, or fundamental frequency characteristics can be better associated with the respective signal sources or speakers. Through a training phase a) using speaker-specific information and the choice of a suitable interaction model in b) the computational effort is significantly minimized, so that the method can be performed quickly and with low resources. It is not mixed spectra with the respective individual speaker parts (in the simplest case, two speakers and a corresponding fundamental frequency pair) trained, but the respective individual speaker parts, which rninimiert the computational effort and the number of training phases to be carried out. As ever Signal source Pitch conditions are considered from a demarcated frequency range (eg 80 to 500 Hz), results in combination of the states in step b) a limited number of fundamental frequency combinations, which are referred to as "possible" basic frequency combinations The term spectrum is further for the Magnitude Spectrum and, depending on the choice of the interaction model in b), the short-term magnitude spectrum or the short-term logarithmic magnitude spectrum (log spectrum) are used.

The number of pitch states to be trained results from the observed frequency range and its subdivision (see below). For voice recordings, such a frequency range is 80 to 500 Hz, for example.

From speech models of individual speakers, a probability model of all pitch combinations possible in the abovementioned frequency range, or for a desired speaker pair (that is, for example, for a recording on which two speakers can be heard) can be obtained with the aid of the interaction model used in b). Assuming two speakers with A states, this means that an A x A matrix with the probabilities for all possible combinations is determined. For the individual speakers also language models can be used, which describe a multiplicity of speakers, for example, by the model on gender-specific characteristics sets off (speaker-independent, or gender-dependent).

For the tracking in c) a number of algorithms can be used. For example, the temporal sequence of the estimated pitch values can be modeled by a Hidden Markov Model (HMM) or by a Factorial Hidden Markov Model (FHMM), and these graphical models can be modeled by the Max-Sum Algorithm, the Junction Tree Algorithm or the Sum Product algorithm are used. In a variant of the invention, it is also possible to independently view and evaluate the pitch values estimated on isolated time windows without applying one of the above-mentioned tracking algorithms.

For the description of the spectrogram properties a general, parametric or nonparametric statistical model can be used. Favorably, in a) the spectrogram properties are determined by means of a Gaussian Mixture Model (GMM).

Advantageously, the number of components of a GMM is determined by applying the Minimum Description Length (MDL) Criterion. The MDL Criterion is used to select a model from a variety of possible models. For example, they differ Models, as in the present case, only by the number of Gauss components used. In addition to the MDL Criterion, for example, the use of the Akaike Information Criterion (AIC) is possible.

In b) the interaction model is a linear model or the mixture-maximization (MixMax) interaction model or the ALGONQUIN interaction model.

Favorably, the tracking in c) takes place by means of the Factorial Hidden Markov Model (FHMM).

To carry out the tracking on a FHMM, a number of algorithms can be used, for example, in variants of the invention, the sum-product algorithm or the max-sum algorithm are used.

In the following the invention will be explained in more detail with reference to a non-limiting embodiment, which is illustrated in the drawing. In this shows schematically:

1 shows a factor graph of the fundamental-frequency-dependent generation of a two individual speaker (log) spectra resulting spectrum (log) y a mixed signal,

Fig. 2 is an illustration of the FHMM, and

3 is a block diagram of the method according to the invention.

The invention relates to a simple and efficient basic frequency tracking modeling method of a plurality of simultaneously emitting signal sources, for example speakers in a conference or meeting situation. In the following, the method according to the invention will be presented on the basis of two speakers for reasons of traceability, however, the method can be applied to any number of subjects. In this case, the speech signals are single-channel, ie with only one recording means - e.g. Microphone - recorded.

The short-term spectrum of a speech signal given a basic speech frequency can be described using probability distributions such as the Gaussian normal distribution. A single normal distribution, given by the parameters mean μ and variance σ ² , is usually not sufficient. For the modeling of general, complex probability distributions one usually uses mixed distributions such as the Gaussian Mixture Model (or Gaüß's mixed distribution model - GMM). The GMM is composed of several individual Gaüß'schen normal distributions additively. An M-fold Gaussian distribution with 3M-1 parameters can be described, Mean, variance, and weighting factor for each of the M Gaussian distributions (the weighting factor of the Mth Gauss component is redundant, hence the "-1".) For the modeling of observed data points by a GMM, a special case of the "Expectation Maxirrdzation" algorithm is often used. as described below.

The course of the pitch states of a speaker can be approximately described by a Markovket- te. The Markov property of these state strings implies that the subsequent state depends only on the current state and not on previous states.

In the analysis of a speech signal of two simultaneously speaking subjects, only the resulting spectrum yW of the mixture of the two individual speech signals is available, but not the pitch states xiW and X2W of the individual speakers. The subscript for the pitch states denotes speakers 1 and 2, while the superscript index of t = 1, T runs. These individual pitch states are hidden variables. For example, a Hidden Markov Model (HMM) is used for the evaluation, in which the hidden variables or states are deduced from the observable states (in this case from the resulting spectrum y ⁽ ' ^{) of} the mixture).

Each hidden variable has in the described embodiment | X | = 170 states with fundamental frequencies from the interval of 80 to 500 Hz. Of course, more or fewer states from other fundamental frequency intervals can also be used.

The state "1" means "no pitch" (unvoiced or no voice activity) while

Züstands values "2" to "170" different fundamental frequencies between the above

Denote values. In particular, the pitch value fo for the states x> 1 after the

f

Formula f ₀ = - determined. The sampling rate is fs = 16 kHz. The pitch interval is

30 + JC

thus unevenly resolved; low pitch values have a finer resolution than high pitch values: States 168, 169, and 170 have fundamental frequencies of 80.80 Hz (x = 168), 80.40 Hz (x = 169), and 80.00 Hz (x = 170), while states 2, 3 and 4 have the fundamental frequencies 500.00 Hz (x = 2), 484.84 Hz (x = 3) and 470.58 Hz (x = 4).

The inventive method comprises in the described embodiment the following steps:

Training phase: training a speaker-dependent GMM to model the short-term spectrum for each of the 170 states (169 fundamental frequency states and the no-pitch state) of each individual speaker; - interaction model: determination of a probabilistic representation for the mixture of the two individual speakers using an interaction model, eg the MixMax interaction model; Depending on the choice of the interaction model, either the short-term magnitude spectrum or the logarithmic short-time magnitude spectrum is modeled in the training phase.

Tracking: determining the fundamental frequency trajectories of the two individual speakers using a suitable tracking algorithm, e.g. Junction Tree or Sum Product (in the present embodiment, the application of the Factorial Hidden Markov Model (FHMM) is described).

Träiningsphase

In the method according to the invention, a supervised scenario is assumed in which the voice signals of the individual speakers are modeled using training data. In principle, all monitored training methods can be used, ie generative and discriminative. The spectrogram properties can be described by a general, parametric or non-parametric statistical model p (si | Xi). The use of GMMs is therefore a special case.

In the present embodiment, using the EM (Expectation Meaximization) algorithm, 170 GMMs are trained for each speaker (one GMM per pitch feed). The training data is, for example, sound recordings of individual speakers, so a set of Ni log spectra of i single speakers, St = [s ^, ... , Sj ^Ni) ], together with the associated pitch values {^, ...,. * ^}. These data can be automatically generated with a pitch tracker of single speaker subscriptions.

The EM algorithm is an iterative optimization method for estimating unknown parameters while preserving known data such as training data. It is iteratively by alternating classification (expectation step) and then adjusting the model parameters (maximization step) maximizes the probability of the occurrence of a stochastic process in a given model.

Since the stochastic process - in the present case the spectrum of the speech signal - is given by the training data, the model parameters must be adapted to maximize. The prerequisite for finding this maximum is that after each induction step and the calculation of a new model, the likelihood of the model increases. To initialize the learning algorithm, a number of superimposed Gaussian distributions and a GMM with arbitrary parameters (eg mean, variance and weighting factors) are chosen. By the iterative maximum-likelihood (ML) estimation of the EM, i thus obtains a representative model for the single-speaker speech signal, in the present case a speaker-dependent GMM p {s _t Θ ^ ' ^* ' j. Thus, for each speaker, 170 GMMs must be trained, that is, one GMM for each pitch state Xi corresponding to the above-defined number of states.

The modeling of the state-dependent log single spectra of the speakers by means of GMM in the present exemplary embodiment thus takes place in accordance with p ( _Sl \ x,) = / Ks, I (£) = I,), ie

Mi _x > 1 denotes the number of mixing components (ie the normal distributions necessary to represent the spectrum), a ™ _x is the weighting factor of each component m = 1, ..., M _ix . "NV" denotes the normal distribution.

The weighting factor a ™. must be positive - a ™ _x > 0 - and satisfy the normalization condition ^ ™ _x = 1. The associated GMM is completely determined by the parameters

®ΐ: ^Χ '= {, ^,) - _^ ^θ ^ ⁼ Κ,' ^Σ ,} ^; ^ ^{stands for the} mean value, Σ denotes the covariance.

After the dreaming phase GMMs are available for all fundamental frequency values of all speakers. In the present exemplary embodiment, this means two speakers each with 170 states from the frequency interval 80 to 500 Hz. It should be pointed out once again that this is an exemplary embodiment and that the method can also be applied to a plurality of signal sources and other frequency intervals.

Interaction Model

For analysis, the recorded and sampled with a sampling frequency of, for example, f _s = 16kHz single-channel speech signals are considered in sections. In each time interval t, the observed (log) spectrum yW of the mixed signal, ie the mixture of the two individual speaker signals, is modeled with the observation probability p (y ^(t) I xiW, X2W). On the basis of this observation probability, for example, the most probable pitch states of both speakers can be determined at any given time, or the observation probability serves directly as input for the tracking algorithm used in step c). In principle, the (log) spectra of the individual speakers, or p (si | xi) and p (s2 1 X2), can be added to the mixed signal y; the magnitude spectra add up approximately, therefore, for the log magnitude spectra: The probability distribution of the mixed signal is thus a function of the two individual signals, p (y) = f (p (si), p (s ₂ )). The function now depends on which interaction model is chosen.

For this, several approaches are possible. In the linear model, the individual spectra are added according to the form given above in the magnetron spectrogram, and the mixed signal is thus approximately the sum of the magnitude spectra of the individual speakers. In simple terms, therefore, the sum of the probability distributions of the two individual speakers, V (si | μι, Σι) and NV (s ₂ 1 μ ₂ , Σ ₂ ), the probability distribution of the mixed signal NV (y | μι + μ ₂ , Σ1 + Σ2) , where normal distributions are mentioned here only for reasons of better comprehension - according to the method according to the invention, the probability distributions are GMMs.

In the illustrated embodiment of the method according to the invention, a further interaction model is used: According to the MixMax interaction model, the log spectrogram of two speakers can be approximated by the element-wise maximum of the log spectra of the individual speakers. This makes it possible to quickly obtain a good probability model of the observed mixed signal. As a result, the duration and computational effort of the learning phase are drastically reduced.

For each time interval t, yW = max (siW, s ₂ W), where SjW is the log magnitude spectrum of the speaker i. The log magnitude spectrum yW is thus generated by means of a stochastic model, as shown in FIG.

In it, the two speakers (i = l, 2) each produce a log magnitude spectrum SjW as a function of the fundamental frequency state χ, Ο. The observed log magnitude spectrum yW of the mixed signal is approximated by the element-wise maxima of both individual speaker log magnitude spectra. In other words, for each frame of the time signal (samples of the time signal are combined in frames, and samples of a frame are then calculated using FFT (fast Fourier transformation) and excluding the phase information, the short-term magnitude spectrum) is the logarithmic Magnitudenspektrogramm of Mixed signal approximated by the element-wise maximum of both logarithmic single-speaker spectra. Instead of looking at the inaccessible speech signals of the individual speakers, the probabilities of the spectra that could previously be learned individually are considered. For a fixed fundamental frequency value with respect to a state XjW, speaker i generates a log spectrum, SjW, representing a realization of the distribution described by the single-speaker model p (siW | XjW).

The two log spectra are then combined by the elementwise maximtim operator to form the observable log spectrum y ⁽ ' ⁾ . So p (y® I sf, s) = ^ (y ^(t) - max (s, ^(t) , s)), where δ (.) Denotes the Dirac delta function.

When using the MixMax interaction model, the GMMs for each state of each speaker must be determined, that is, twice the cardinality of the state variables. In conventional models, assuming 170 different fundamental frequency states, a total of 28,900 different fundamental frequency pairs result for each speaker, which results in a significantly increased computational effort.

In addition to the linear model and the MixMax interaction model, other models can also be used. An example of this is the Algonquin model, as described, for example, by Brendan J. Frey et al. in "ALGONQUIN - Learning Dynamic Noise Models from Noisy Speech for Robust Speech Recognition" (Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, pp. 1165-1172, January 2002).

As with the MixMax interaction model, the Algonquin model models the log magnitude spectrum of the mixture of two speakers. While in the MixMax interaction model applies, the Algonquin model has the following form: From this, in turn, the probability distribution of the mixed signal can be derived from the probability distribution of the individual speaker signals.

As already mentioned, only the MixMax interaction model is treated in the illustrated embodiment of the method according to the invention.

tracking

The task of tracking basically involves finding a sequence of hidden states x * that maximizes the conditional probability distribution x * = arg max _x p (x | y). For the tracking of the pitch profiles over time, an FHMM is used in the described embodiment of the method according to the invention. The FHMM allows to track the states of multiple parallel Markov chains, with the available observations as a common effect of all Markov chains become. The results described under the point "interaction model" are used.

In a FHMM, therefore, several Markov chains are considered in parallel, as is the case for example in the described embodiment, where two speakers speak simultaneously. The situation that results is shown in FIG.

As mentioned above, the hidden state variables of the individual speakers are denoted by XkW, where k denotes the Markov chains (and thus the speakers) and the time index t runs from 1 to T. The Markov chains 1, 2 are shown in Fig. 2 extending horizontally. The assumption is that all hidden state variables have the cardinality IXI, ie 170 states in the exemplary embodiment described. The observed random variable is denoted by y ⁽ ' ⁾ .

The dependence of the hidden variables between two successive time periods is defined by the transition probability p (xkW | x ^ '' ¹ ') - The dependence of the observed random variable yW on the hidden variables of the same time segment is defined by the observation probability p (yW | xiW , X2 ⁽⁾ ) which, as already mentioned above, can be created by means of an interaction model The output probability of the hidden variables in each chain is given as p (xkW).

The whole sequence of the variables is x = {χ '^, x [ ^!) And y = [y ^l) }, the following expression results for the common distribution of all variables: p (x, y) = p (y I x ) p (x) = f [

The observation probability p (y ^i> | ) is generally obtained by marginalization over the unknown (log) spectra of the individual speakers:

p (y *> II s>, s) p (s> | x?) p (s? | x) ds »ds» (1), where Represents interaction model.

This yields the following representation for (1) when using speaker-specific GMMs, marginalization over Si, and using the MixMax model:

where d gives the dth element of the log spectrum y, θ ™ gives the dth element of the associated mean and variance, and φ (γ \ Θ) = \ NV (x \ θ) άχ represents the univariate cumulative normal distribution represents.

Likewise, for (1) using the linear interaction model, the following representation is obtained: where y is the spectrum of the composite signal.

Fig. 3 shows a schematic representation of the sequence of the erfindüngsgemäßen method based on a block diagram.

A speech signal, or a composite signal of a plurality of individual signals, is recorded with one channel, for example with a microphone. This process step is designated by 100 in the block diagram.

In an independent method step, which is carried out, for example, in advance of the application of the method, the speech signals of the individual speakers are modeled using training data in a training phase 101. Using the EM (Expectation Maximization) algorithm, one speaker dependent GMM is trained for each of the 170 pitch states. The training phase is done for all possible states - in the described embodiment, for each of the two speakers, 170 states are between 80 and 500 Hz. In other words, a pitch-dependent spectrogram is trained by each speaker by means of GMM, the MDL Criterion being applied to the find optimal number of Gauss components. In a further step 102, the GMMs or the associated parameters are stored, for example in a database.

103: In order to obtain a probabilistic representation of the mixed signal of two or more speakers or of the individual signal components of the mixed signal, an interaction model, preferably the MixMax interaction model, is used. Subsequently, within the framework of the track 104 of the fundamental frequency courses, the FHMM is applied. Using FHMM it is possible to access the states of several hidden Markov processes which run concurrently, considering the available observations as effects of the individual Markov processes.

Claims

A method for determining fundamental frequency slopes of a plurality of signal sources from a single-channel audio recording of a composite signal, comprising the steps of: a) determining the spectrogram characteristics of the pitch conditions of individual signal sources using training data;

2. The method according to claim 1, characterized in that in a) the spectrogram properties are determined by means of a Gaussian Mixture Model (GMM).

3. The method according to claim 2, characterized in that further the Minimurrv- Decscription-Length Criterion is applied to determine the number of components of the GMM.

4. The method according to any one of claims 1 to 3, characterized in that in b) as the interaction model, a linear model or the MixMax interaction model or the ALGONQUIN interaction model are used.

5. The method according to any one of claims 1 to 4, characterized in that the tracking in c) by means of the Factorial Hidden Markov model (FHMM) takes place.

6. The method according to claim 5, characterized in that are used to solve the FHMM the sum-product algorithm or the max-sum algorithm.