CN102760435A

CN102760435A - Frequency-domain blind deconvolution method for voice signal

Info

Publication number: CN102760435A
Application number: CN2012102278402A
Authority: CN
Inventors: 丁志中; 黄玉雷; 戴礼荣; 陈小平
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2012-07-03
Filing date: 2012-07-03
Publication date: 2012-10-31

Abstract

The invention discloses a frequency-domain blind deconvolution method for a voice signal, comprising the following steps of: converting a time-domain convolution mixed voice signal to a frequency domain and then performing blind separation; converting and transforming the time-domain convolution mixed voice signal to a frequency-domain linear instantaneous mixture model via windowed Fourier transform according to the short-time stability of the voice signal; after performing pre-processing such as filtering and whitening in the frequency domain, realizing segmented blind separation for the voice signal by adopting a method of the approximate joint diagonalization of correlation matrices under different time delays; and after solving the problem of the fuzziness of the blind separation for the signal, performing segmented recombination for the separated signals in the time domain via inverse Fourier transform. Via the frequency-domain blind deconvolution method disclosed by the invention, a good separation effect is realized for 2*2 real-time recoded mixed voice signal, and the recognition accuracy of the voice signal of a human-computer interaction system in an environment with the speech interference of other people can be efficiently improved.

Description

A kind of voice signal frequency domain blind deconvolution method

Technical field

The invention belongs to voice signal extraction and identification field in the multimedia messages processing, be specifically related to a kind of voice signal frequency domain blind deconvolution method, can be applicable to improve in the man-machine interaction scene interactive identification rate.

Background technology

The automatic speech recognition technology was through the development in surplus 60 years, and under noiseless or noiseless environment, discrimination surpasses 95%.But when especially two or more speakers spoke simultaneously in actual application environment, phonetic recognization rate descended suddenly, this limited greatly should technology in man-machine interaction (Human-Machine Interaction, the application in HMI).Human auditory system can be obtained own information of interest in noisy environment, and the robot that is under the man-machine interaction environment is difficult to have this ability.Blind Signal Separation is exactly a kind of technology that the mixed signal that under original signal and the equal condition of unknown of transmission channel, only obtained by receiving sensor is estimated original signal.

Blind separation under the HMI environment belongs to blind uncoiling category, and for mixing voice signal under convolution mixed signal or the true environment, academia mainly contains two kinds of methods it is carried out deconvolution, and a kind of is the blind uncoiling of time domain, and another kind is the blind uncoiling of frequency domain.The blind uncoiling of time domain mainly is based on the ICA notion scalar hybrid matrix under the linear instantaneous mix is expanded to the wave filter hybrid matrix under the convolution mix, and objective function and iterative algorithm are done certain correction.The blind uncoiling basic idea of frequency domain is to utilize Short Time Fourier Transform that the time domain convolution mixed signal is transformed to the instantaneous mixed signal of frequency domain; Utilize the comparatively ripe blind separation algorithm of instantaneous mixing that the frequency domain mixed signal is separated again; Promptly each frequency utilizes the blind separation algorithm of instantaneous mixing to separate in frequency domain, solves the time-domain signal after the uncertain and signal amplitude of the order of output signal obtains separating through inverse Fourier transform after uncertain again.

The inferior position of the blind uncoiling of time domain is that calculated amount is too big, and especially when compound filter was comparatively complicated, each rank of finding the solution wave filter all will rely on finding the solution of all the other rank.For example the diagonal angle constant that proposes of Chan divides from matrix algorithms, and compound filter is 5 rank when following, and algorithm is separating mixture of source signals fast, and when the exponent number of wave filter be 6 rank when above, velocity of separation obviously descends and the separating effect variation.And frequency domain algorithm is separate in each Frequency point separation, and the compound filter exponent number is little a lot of than Time-Domain algorithm to the calculated amount influence.

Existing blind deconvolution method is also few both at home and abroad, and there is deficiency in the following areas in existing method:

1) most of algorithms obtain under certain qualifications, and separating effect is undesirable, and it is bigger to separate back signal cross interference, and robustness is not high.

2) in true environment man-machine interaction process, recognition correct rate is not high.

3) existing algorithm search speed is slow, and real-time is relatively poor, can not well be applied to the real time human-machine interaction scene.

Summary of the invention

It is not enough to the present invention is directed to above-mentioned existing in prior technology, discloses a kind of voice signal frequency domain blind deconvolution method, and this method is carried out blind separation through the time domain convolution mixed signal is transformed to frequency domain, and separating effect is better, can be applicable to field of speech recognition.

Technical solution problem of the present invention adopts following technical scheme:

Voice signal frequency domain blind deconvolution method is characterized in that: time domain convolution hybrid voice is transformed to frequency domain carry out blind separation, specifically may further comprise the steps:

1) divide frame to the self-adaptation of original audio file, when SF was 16KHz, frame length was got 16ms, and frame pipettes 2ms;

2) the single frames data are carried out Fourier transform, change the convolution mixed signal model into linear mixed model; The convolution mixture model can be expressed as

x (t) = H &CircleTimes; s (t)

(

The expression convolution) (1)

The Short Time Fourier Transform of signal can be expressed as

X (ω, t_{s}) = \underset{t}{&Sum;} e^{? j ω t} x (t) w (t ? t_{s}) - - - (2)

Wherein

X (ω, t_{s}) = H (ω) S (ω, t_{s}) - - - (3)

X (ω, t _s) be the Short Time Fourier Transform of x (t), w (t) is a window function; Suppose when commingled system is to become, by (1) Shi Kede

X (ω, t_{s}) = H (ω) S (ω, t_{s}) - - - (3)

Wherein H (ω) and S (ω, t _s) be respectively the Fourier transform of compound filter H (p) and source signal s (t), H (ω) can estimate separately at each Frequency point;

3) adopting characteristic value decomposition that input signal is carried out albefaction handles; The covariance matrix of mixed signal can be broken down into

R_{x} (0) = \frac{1}{T} {&Sum;}_{t = 0}^{T ? 1} x (t) x^{?} (t) = QΛ Q^{- 1} - - - (4)

Here Λ=diag (d ₁, d ₂... D _n) be diagonal matrix, its element is covariance matrix R _x(0) eigenwert, Q is the characteristic of correspondence vector, Q ^-1It is the inverse matrix of Q; The albefaction matrix V can be expressed as

V = Λ^{? \frac{1}{2}} Q^{? 1} - - - (5)

is the square root of the inverse matrix of covariance matrix Λ;

4) a rotation matrix U is promptly sought in correlation matrix associating diagonalization, makes following formula reach minimum;

{{&Sum;}_{τ = 1}^{r} \underset{i &NotEqual; j}{&Sum;} | {(U R_{z} (τ) U^{*})}_{i j} |}^{2} - - - (6)

Here R _z(τ) be defined as

R_{z} (τ) = \frac{1}{T} {&Sum;}_{t = 0}^{T ? 1} z (t) z^{?} (t + τ)   , τ = 1, 2, \cdot \cdot \cdot r - - - (7)

Frequency domain is separated hybrid matrix W

W = U V - - - (8)

5) definition output signal spectrum Y ₁(ω) and Y ₂(ω), amplitude a ₁(ω) and a ₂(ω) related coefficient does

r (a_{1} (ω), a_{2} (ω)) = \frac{cov (a_{1} (ω), a_{2} (ω))}{\sqrt{D (a_{1} (ω))} \sqrt{D (a_{2} (ω))}} - - - (9)

Wherein covariance does

a ₁(ω, m) being illustrated in first signal is the signal component amplitude of ω in m window and frequency;

7) inverse Fourier transform in short-term of calculating (2) formula

X (t) = \frac{1}{2 π} \frac{1}{W (t)} \underset{t_{s}}{&Sum;} \underset{ω}{&Sum;} e^{j ω (t ? t_{s})} X (ω, t_{s}) - - - (11)

Here

W (t) = \underset{t_{s}}{&Sum;} w (t ? t_{s}) . - - - (12)

The present invention is according to the stationarity in short-term of voice signal; The time domain convolution mixed signal is transformed into frequency domain linear instantaneous mixture model through the windowing Fourier transform; In frequency domain after the pre-service such as filtering, albefaction; Adopt the blind separation of the diagonalizable method realization segmentation voice signal of the approximate associating of correlation matrix under the different delay, behind the fuzzy problem that has solved blind signal separation, in time domain, carry out the reorganization of segmentation separation signal through inverse fourier transform.

Beneficial effect of the present invention is:

1) the present invention has good separating effect to 2 * 2 faithful record mixing voice signals, can improve effectively to have other people the speak voice signal recognition correct rate of man-machine interactive system under the interference environment.

2) the present invention carries out blind separation through the time domain convolution mixed signal is transformed to frequency domain, and separating effect is better, can be applicable to field of speech recognition.

Description of drawings

The system flowchart of Fig. 1 the inventive method.

Embodiment

With reference to Fig. 1, a kind of voice signal frequency domain blind deconvolution method transforms to frequency domain with time domain convolution hybrid voice and carries out blind separation, specifically may further comprise the steps:

x (t) = H &CircleTimes; s (t)

( The expression convolution) (1)

The Short Time Fourier Transform of signal can be expressed as

X (ω, t_{s}) = \underset{t}{&Sum;} e^{? j ω t} x (t) w (t ? t_{s}) - - - (2)

Wherein

X (ω, t_{s}) = H (ω) S (ω, t_{s}) - - - (3)

R_{x} (0) = \frac{1}{T} {&Sum;}_{t = 0}^{T ? 1} x (t) x^{?} (t) = QΛ Q^{- 1} - - - (4)

Here Λ=diag (d ₁, d ₂D _n) be diagonal matrix, its element is covariance matrix R _x(0) eigenwert, Q is the characteristic of correspondence vector, Q ^-1It is the inverse matrix of Q; The albefaction matrix V can be expressed as

V = Λ^{? \frac{1}{2}} Q^{? 1} - - - (5)

is the square root of the inverse matrix of covariance matrix Λ;

{{&Sum;}_{τ = 1}^{r} \underset{i &NotEqual; j}{&Sum;} | {(U R_{z} (τ) U^{*})}_{i j} |}^{2} - - - (6)

Here R _z(τ) be defined as

R_{z} (τ) = \frac{1}{T} {&Sum;}_{t = 0}^{T ? 1} z (t) z^{?} (t + τ)    τ = 1, 2, \cdot \cdot \cdot r - - - (7)

Frequency domain is separated hybrid matrix W

W = U V - - - (8)

r (a_{1} (ω), a_{2} (ω)) = \frac{cov (a_{1} (ω), a_{2} (ω))}{\sqrt{D (a_{1} (ω))} \sqrt{D (a_{2} (ω))}} - - - (9)

Wherein covariance does

7) inverse Fourier transform in short-term of calculating (2) formula

X (t) = \frac{1}{2 π} \frac{1}{W (t)} \underset{t_{s}}{&Sum;} \underset{ω}{&Sum;} e^{j ω (t ? t_{s})} X (ω, t_{s}) - - - (11)

Here

W (t) = \underset{t_{s}}{&Sum;} w (t ? t_{s}) . - - - (12)

Following table 1 is the present invention and two kinds of typical blind deconvolution method performance comparison:

Table 1

Claims

1. voice signal frequency domain blind deconvolution method is characterized in that: time domain convolution hybrid voice is transformed to frequency domain carry out blind separation, specifically may further comprise the steps: