CN102054480B

CN102054480B - Method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT)

Info

Publication number: CN102054480B
Application number: CN2009102359018A
Authority: CN
Inventors: 茹婷婷; 谢湘; 匡镜明
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2009-10-29
Filing date: 2009-10-29
Publication date: 2012-05-30
Anticipated expiration: 2029-10-29
Also published as: CN102054480A

Abstract

The invention relates to a method for separating monaural overlapping speeches based on fractional Fourier transform (FrFT), which belongs to the technical field of audio signal processing. The method comprises the following steps: firstly, preprocessing overlapping speech signals so as to remove mute-section signals of the overlapping speech signals and find out sonant frames; then, carrying out pitch detection on sonant-frame signals based on FrFT so as to separate the fundamental frequencies of the overlapping speeches; and finally, integrating the fundamental frequencies with a sinusoidal model of speech signals so as to synthesize speeches, thereby obtaining each speech signal subjected to separation. The method provided by the invention has the advantages that the fundamental frequencies of a plurality of overlapping speeches can be separated and extracted effectively, and finally, the effective separation of the overlapping speeches can be realized; and the pitch frequencies are extracted based on FrFT instead of traditional fast Fourier transform (FFT), thereby reducing the extension of a harmonic frequency spectrum and then obtaining more accurate fundamental frequencies of original signals. The method provided by the invention is especially suitable for the separation of monaural overlapping speeches containing speeches of two persons.

Description

A kind of monophony aliasing speech separating method based on fraction Fourier conversion

Technical field

The present invention relates to a kind of method of utilizing fraction Fourier conversion to carry out monophony aliasing speech Separation, belong to the Audio Signal Processing technical field.

Background technology

In voice and audible signal process field, have an important problem is how from the aliasing voice signal, to isolate the interested voice of people.The aliasing speech Separation all has important significance for theories and use value at aspects such as voice communication, acoustic target detection, voice signal enhancings; But because it is overlapping fully on time domain and frequency domain to constitute each source voice signal of aliasing voice, sound enhancement method commonly used is difficult to people's interested voice of institute (being called target speech) are separated from disturb voice.

(Fractional Fourier Transform FrFT) has very excellent characteristic for analyzing some non-stationary signal to fraction Fourier conversion, becomes a kind of instrument that causes signal Processing circle extensive concern in recent years.As the voice of non-stationary signal, FrFT or the application of similar conversion in voice signal is handled mainly concentrate on the following aspects at present: speech analysis can provide the time frequency resolution higher than traditional fourier transform method; Fundamental tone is estimated, can provide than the more accurate fundamental tone of classic method and estimate; Voice strengthen; Speech recognition; And Speaker Identification etc.

Research aspect the aliasing speech Separation, (Auditory Scene Analysis ASA) separates (Blind Source Separation, BSS) two types with blind source mainly to be divided into auditory scene analysis.The research of auditory scene analysis has two kinds of methods: a kind of is auditory physiology and psychological characteristic from the people, the rule of research people in the voice recognition process, i.e. auditory scene analysis; Another kind is to utilize the achievement in research of people's sense of hearing perception is set up model; Model is carried out mathematical analysis and realizes it with computing machine; This be calculate auditory scene analysis (Computational Auditory Scene Analysis, CASA) the content that will study.The separation of blind source is meant under source signal, transmission channel characteristics condition of unknown, is only estimated the process of each component of source signal by some prioris (like probability density) of observation signal and source signal.The independent component analysis method that separate in blind source at first is to be proposed by P.Comon, and it is based on a kind of technology that neural network and statistical base growth are got up, and is the active field, forward position of a ten minutes.

Mainly there is following deficiency in existing aliasing speech separating method:

(1) auditory scene analysis also is in the starting stage with the research of calculating auditory scene analysis.Particularly in calculating auditory scene analysis research, the model of being set up can only be used for verifying some clear inadequately theories of auditory scene analysis research, and promptly human brain is handled the mechanism of audible signal.

Research to blind source separation method is very active; But this problem also is not well solved; It relates to the stability and the phase place uncertain problem of multichannel convolutive aliasing system and blind deconvolution system, especially the situation of blind deconvolution problem and band noise when the number in source is unknown.

(2) the fundamental frequency separation and Extraction of aliasing voice is to realize the key of aliasing speech Separation in the auditory scene analysis, but existing aliasing speech pitch separating and extracting process is only considered the aliasing of voiced sound and voiced sound, does not consider the aliasing of voiceless sound and voiced sound.This is that pumping signal is acyclic because in the unvoiced frames of voice signal, therefore estimates that the fundamental frequency of unvoiced frames does not have practical significance.Moreover; The common randomness of the fundamental frequency that unvoiced frames is estimated out is strong; Do not have continuity, and the fundamental frequency that separation and Extraction goes out from the aliasing voice is to judge its ownership with the continuity of fundamental frequency, so; The fundamental frequency that unvoiced frames estimates can influence the fundamental tone ownership to be judged, and then influences the smoothing processing effect of fundamental frequency.

Summary of the invention

The objective of the invention is to solve problem how from monophony aliasing voice signal, to isolate target speech, propose a kind of new monophony aliasing speech separating method based on fraction Fourier conversion for overcoming the defective of prior art.

The technical scheme that the present invention adopted is following:

A kind of monophony aliasing speech separating method based on fraction Fourier conversion may further comprise the steps:

Step 1, the aliasing voice signal is carried out pre-service, remove its quiet segment signal, find out unvoiced frame.

At first, the aliasing voice signal is carried out end-point detection, removes its quiet segment signal, remaining alias band signal as process object.

Then, residue alias band signal is carried out the branch frame handle, and carry out pure and impure sound and judge, mark unvoiced frame.

Step 2, based on fraction Fourier conversion, the unvoiced frame signal after step 1 is handled is carried out pitch Detection, isolate the pitch contour of aliasing voice, the fundamental frequency of each source signal just, process is following:

At first, calculate the exponent number of FrFT according to the continuity of every frame signal.Then, the unvoiced frame signal is carried out the FrFT conversion, try to achieve the long-pending spectrum of harmonic wave, extract one of them people's fundamental frequency again with dynamic programming method, i.e. the fundamental frequency of a source signal.

After the fundamental frequency of finding a people, in the long-pending spectrum of harmonic wave, deduct this person's the pairing spectrum composition of fundamental frequency harmonic, and then use a dynamic programming, can obtain another person's fundamental frequency,, i.e. the fundamental frequency of another source signal;

Repeat said process, can obtain the fundamental frequency of each source signal.

Step 3, since voice signal can enough one group of sinusoidal signals stack represent, therefore,, come synthetic speech in conjunction with the sinusoidal model of voice signal according to each the bar fundamental frequency that obtains through step 2, thus each voice signal after obtaining separating.

Good effect of the present invention and advantage are:

1. the fundamental frequency of a plurality of aliasing voice can effectively separated and extract to use the inventive method, thereby realize effective separation of aliasing voice.

2. adopt based on FrFT to replace traditional FFT (short time discrete Fourier transform) to extract fundamental frequency, reduced the extension of harmonic spectrum.

3. because every frame signal all has its intrinsic modulating frequency, use FrFT can select suitable exponent number to make it meet the intrinsic frequency modulation rate of signal, thereby obtain the fundamental frequency of original signal more accurately.

The present invention is particularly useful for separating the monophony aliasing voice that contain two people's voice.

Description of drawings

Fig. 1 is the realization flow block diagram of the inventive method.

Fig. 2 is the aliasing Pitch Detection of Speech Signals process flow diagram based on fraction Fourier conversion in the inventive method.

Embodiment

Below in conjunction with accompanying drawing preferred implementation of the present invention is described further.

A kind of monophony aliasing speech separating method based on fraction Fourier conversion, its realization flow is as shown in Figure 1, may further comprise the steps:

At first, the aliasing voice signal is carried out end-point detection, removes its quiet segment signal, remaining alias band signal as process object.The method that end-point detection can adopt short-time energy and zero-crossing rate to combine.

Then, residue alias band signal is carried out the branch frame handle, the frame length when dividing frame is 20ms, and frame moves and is 10ms.At this moment, carry out pure and impure sound and judge, and mark unvoiced frame.The pure and impure sound of aliasing voice signal is judged with the judgement of individual voice slightly different, and the pure and impure situation of two aliasing voice has 3 kinds: two voiced sounds, one clear one turbid, two voicelesss sound.The pure and impure sound judgement of aliasing voice was divided into for two steps: judge earlier whether two aliasing signals are two voicelesss sound, if, judge and finish, if not, judge that again two aliasing signals are one clear one turbid or pair voiced sounds.For one clear one turbid, only unvoiced frame is carried out subsequent treatment, do not handle unvoiced frames.For two voiceless sound signals, it is not handled equally.

Step 2, employing are carried out pitch Detection based on the fraction Fourier conversion mode to the unvoiced frame after handling through step 1, isolate the pitch contour of aliasing voice, just isolate the fundamental frequency of each source signal.Its realization flow is as shown in Figure 2.

At first, according to the continuity of every frame signal, calculate the exponent number of FrFT.Consider that purpose is the fundamental frequency of finding the solution voice signal, and be to search for fundamental frequency that the fundamental frequency of the exponent number α i of FrFT and front and back two frames is closely related, therefore represent with following formula with the continuous characteristic of interframe:

α_{i} = 1 - | \frac{p_{i} - p_{i - 1}}{p_{i} + p_{i + 1}} | - - - 1.1

Wherein, p _I-1, p _i, p _I+1Be respectively the estimation fundamental frequency of former frame, present frame and next frame, p _I-1, p _i, p _I+1Can obtain through short time discrete Fourier transform.

Then, the unvoiced frame signal that obtains after handling through step 1 is carried out the FrFT conversion again, try to achieve the long-pending spectrum of harmonic wave, extract wherein a pitch contour, just one of them people's fundamental frequency with dynamic programming method again.Detailed process is following:

(1) to unvoiced frame signal x (n), adopt following formula to carry out the fraction Fourier conversion of N point (for example 1024 points), obtain its amplitude spectrum X (α, k):

X(α，k)＝FrFT _N{x(n)} 1.2

Again with amplitude spectrum X (α k) transforms to log-domain, obtain logarithm amplitude spectrum SLog (α, k):

SLog(α，k)＝log ₁₀(|X(α，k)| ²) 1.3

With all the harmonic wave logarithmic spectrum SLog in the frame signal (α k) sues for peace, obtain the long-pending spectrum of harmonic wave ρ (α, f):

ρ (α, f) = \frac{1}{H} Σ_{h = 1}^{H} SLog (α, hf) - - - 1.4

In the formula 1.4, H is the harmonic wave number in the sampling bandwidth, and h is the value of harmonic wave index, and f is the fundamental frequency of every frame, and α is the exponent number of every frame.

(2) consider the aliasing of two voice, (α extracts M the candidate peak that possibly contain fundamental component in f) from the long-pending spectrum of harmonic wave ρ.Consider the problem of calculated amount, the value of M is greater than and equals 3.When M more than or equal to 3 the time, the result who obtains does not have to change basically.

Need to set a target function in the dynamic programming method, every paths is all calculated the value of its target function, the pairing path of maximal value is desired wherein pitch contour.In order to prevent half frequency mistake or frequency multiplication mistake in the estimation procedure of pitch period, to occur, with target function c (α, f _g) be set at:

c(α，f _g)＝k(f _g)*(ρ(α，f _g)-ρ(α，f _g/2)) 1.5

In the formula 1.5, f _gBe the estimation fundamental frequency of every frame signal, k (f _g) for following f _gThe function that successively decreases.Set weighted value k (f _g) can avoid the frequency multiplication mistake, introduce ρ (α, f _g/ 2) can avoid half frequency mistake.Therefore, with (α _i, f _Gi) be designated as μ _i, the score function S in path _i(μ _i) be set at:

S_{i} (μ_{i}) = S_{i - 1} (μ_{i - 1}^{*}) + c (μ_{i}) - - - 1.6

μ_{i - 1}^{*} = \arg_{μ_{i - 1}}^{\max} [s_{i - 1} (μ_{i - 1}) + c (μ_{i})] - - - 1.7

In the formula 1.6,1.7; I representes frame number, be the parameter when selecting suitable exponent number and obtaining i-1 frame fundamental frequency.Because the fundamental frequency scope that the normal person speaks is 50Hz-400Hz; Therefore in this scope, search for fundamental frequency; In two peak points of every frame signal, all can find the f value of selecting to make score function Si (μ i) maximum, promptly think one of them people's in this frame signal fundamental frequency.In like manner, after all signals of search, can be linked to be a pitch contour, thereby obtain one of them people's pitch contour (being this person's fundamental frequency).

After the fundamental frequency of finding a people; (α deducts this person's the pairing spectrum composition of fundamental frequency harmonic in f), and then uses dynamic programming method one time at the long-pending spectrum of harmonic wave ρ; Can obtain another person's pitch contour (being this person's fundamental frequency), thereby isolate the fundamental frequency of aliasing voice.

The method of asking for the pairing spectrum composition of harmonic wave is following:

When in the long-pending spectrum of harmonic wave, deducting the pairing spectrum composition of harmonic wave, at first to know harmonic wave number H _i, can know thus to deduct several spectrum compositions actually.According to formula 1.8, can obtain the harmonic wave number H of i frame signal _i,

H_{i} = \frac{f_{s} / 2}{f_{i}} - - - 1.8

In the formula 1.8, f _iBe the fundamental frequency of i frame, f _sBe sampling rate.Then the relation of harmonic frequency f ' and fundamental frequency f is following:

f′＝h*f，h＝2，3，4，K，H 1.9

In the formula 1.9, H is the harmonic wave number.Obtain harmonic frequency f ', promptly known the pairing spectrum composition of harmonic wave.

Step 3, since voice signal can enough one group of sinusoidal signals stack represent, therefore, according to each the bar fundamental frequency f that obtains through step 2 _i, come synthetic speech in conjunction with the sinusoidal model of voice signal, thus each voice signal after obtaining separating.

Claims

1. monophony aliasing speech separating method based on fraction Fourier conversion is characterized in that may further comprise the steps:

Step 1, the aliasing voice signal is carried out pre-service, remove its quiet segment signal, find out unvoiced frame;

At first, calculate the exponent number of FrFT, then, the unvoiced frame signal is carried out the FrFT conversion, try to achieve the long-pending spectrum of harmonic wave, extract one of them people's fundamental frequency again with dynamic programming method, i.e. the fundamental frequency of a source signal according to the continuity of every frame signal;

After the fundamental frequency of finding a people, in the long-pending spectrum of harmonic wave, deduct this person's the pairing spectrum composition of fundamental frequency harmonic, and then use a dynamic programming, can obtain another person's fundamental frequency, i.e. the fundamental frequency of another source signal;

Repeat said process, can obtain the fundamental frequency of each source signal;

Step 3, according to each the bar fundamental frequency that obtains through step 2, come synthetic speech in conjunction with the sinusoidal model of voice signal, thus each voice signal after obtaining separating.

2. a kind of monophony aliasing speech separating method based on fraction Fourier conversion as claimed in claim 1 is characterized in that, in the said step 1, after removing quiet segment signal, the method for residue alias band signal being carried out the processing of branch frame is following:

Frame length when dividing frame is 20ms, and frame moves and is 10ms, at this moment, carries out pure and impure sound and judges, and mark unvoiced frame; The pure and impure sound judgement of aliasing voice was divided into for two steps: judge earlier whether two aliasing signals are two voicelesss sound, if, judge and finish, if not, judge that again two aliasing signals are one clear one turbid or pair voiced sounds; For one clear one turbid, only unvoiced frame is carried out subsequent treatment, do not handle unvoiced frames; For two voiceless sound signals, it is not handled equally.

3. according to claim 1 or claim 2 a kind of monophony aliasing speech separating method based on fraction Fourier conversion is characterized in that, in step 2, and when calculating the exponent number of FrFT, the exponent number α of FrFT _iRepresent with following formula with the fundamental frequency of front and back two frames:

α_{i} = 1 - | \frac{p_{i} - p_{i + 1}}{p_{i} + p_{i + 1}} | - - - 1.1

Wherein, p _I-1, p _i, p _I+1Be respectively the estimation fundamental frequency of former frame, present frame and next frame.

4. according to claim 1 or claim 2 a kind of monophony aliasing speech separating method based on fraction Fourier conversion; It is characterized in that, behind the exponent number that calculates FrFT, the unvoiced frame signal that obtains after handling through step 1 is carried out the FrFT conversion; Try to achieve the long-pending spectrum of harmonic wave; Extract wherein pitch contour with dynamic programming method again, fundamental frequency just, its detailed process is following:

(1) to unvoiced frame signal x (n), adopt following formula to carry out the fraction Fourier conversion that N is ordered, obtain its amplitude spectrum X (α, k):

X(α，k)＝FrFT _N{x(n)} 1.2

With amplitude spectrum X (α k) transforms to log-domain, obtain logarithm amplitude spectrum SLog (α, k):

SLog(α，k)＝log ₁₀(|X(α，k)| ²) 1.3

With all the logarithm amplitude spectrum SLog in the frame signal (α k) sues for peace, obtain the long-pending spectrum of harmonic wave ρ (α, f):

ρ (α, f) = \frac{1}{H} Σ_{h = 1}^{H} SLog (α, hf) - - - 1.4

In the formula 1.4, H is the harmonic wave number in the sampling bandwidth, and h is the value of harmonic wave index, and f is the fundamental frequency of every frame, and α is the exponent number of every frame;

(2) (α extracts M the candidate peak that possibly contain fundamental component in f), and the value of M is greater than and equals 3 from the long-pending spectrum of harmonic wave ρ;

Need to set a target function in the dynamic programming method, every paths is all calculated the value of its target function, the pairing path of maximal value is desired wherein fundamental frequency; With target function c (α, f _g) be set at:

c(α，f _g)＝k(f _g)*(ρ(α，f _g)-ρ(α，f _g/2)) 1.5

In the formula 1.5, f _gBe the estimation fundamental frequency of every frame signal, k (f _g) for following f _gThe function that successively decreases; With (α _i, f _Gi) be designated as μ _i, the score function S in path _i(μ _i) be set at:

S_{i} (μ_{i}) = S_{i - 1} (μ_{i - 1}^{*}) + c (μ_{i}) - - - 1.6

μ_{i - 1}^{*} = \arg_{μ_{i - 1}}^{\max} [s_{i - 1} (μ_{i - 1}) + c (μ_{i})] - - - 1.7

In the formula 1.6,1.7, i representes frame number,

It is the parameter when selecting suitable exponent number to obtain i-1 frame fundamental frequency; Because the fundamental frequency scope that the normal person speaks is 50Hz-400Hz, therefore in this scope, search for fundamental frequency, in two peak points of every frame signal, all can find and select to make score function S _i(μ _i) maximum f _GiBe worth, promptly think one of them people's in this frame signal fundamental frequency; In like manner, after all signals of search, can be linked to be a pitch contour, thereby obtain one of them people's fundamental frequency;

After the fundamental frequency of finding a people, (α deducts this person's the pairing spectrum composition of fundamental frequency harmonic in f), and then uses dynamic programming method one time, can obtain another person's fundamental frequency, thereby isolate the pitch contour of aliasing voice at the long-pending spectrum of harmonic wave ρ;

When in the long-pending spectrum of harmonic wave, deducting the pairing spectrum composition of harmonic wave, at first to know harmonic wave number H _i, can know thus to deduct several spectrum compositions actually; According to formula 1.8, can obtain the harmonic wave number H of i frame signal _i,

H_{i} = \frac{f_{s} / 2}{f_{i}} - - - 1.8

In the formula 1.8, f _iBe the fundamental frequency of i frame, f _sBe sampling rate; Then the relation of harmonic frequency f ' and fundamental frequency f is following:

f′＝h*f，h＝2，3，4，...，H 1.9

In the formula 1.9, H is the harmonic wave number, has obtained harmonic frequency f ', has promptly known the pairing spectrum composition of harmonic wave.