CN110706709B

CN110706709B - Multi-channel convolution aliasing voice channel estimation method combined with video signal

Info

Publication number: CN110706709B
Application number: CN201910816592.7A
Authority: CN
Inventors: 杨俊杰; 杨祖元; 谢胜利; 杨超; 解元
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-11-19
Anticipated expiration: 2039-08-30
Also published as: CN110706709A

Abstract

The invention discloses a multi-channel convolution aliasing voice channel estimation method combined with video signals, which introduces a novel mathematical tool and an analysis method, fuses video and audio signal information and realizes effective estimation of a convolution voice aliasing channel. The method extracts speaker mouth shape characteristic data through non-negative matrix factorization by means of speaker mouth region video signals; and detecting a clustering center of the speaker mouth characteristic data by using a density clustering method, detecting an image frame of the speaker mouth in a silent state, and further extracting all time windows dominated by the single speaker voice. And according to the local dominant time window information, calculating a local dominant covariance matrix from the time-frequency domain observation voice signal components, and extracting a dominant eigenvector through eigenvalue decomposition, thereby realizing aliasing voice channel estimation. Compared with the aliasing voice channel estimation method under the current popular single-mode audio, the method proves the superiority of the method from numerical experiments.

Description

Multi-channel convolution aliasing voice channel estimation method combined with video signal

Technical Field

The invention relates to the field of voice signal processing, in particular to a multi-channel convolution aliasing voice channel estimation method combined with a video signal.

Background

The task of Audio Speech Separation (ASS) is to separate the voice of a target speaker from a mixed Speech signal of multiple speakers received by a microphone by means of signal processing. This is a very challenging issue in the field of signal processing. Before the complete separation of voice is realized, obtaining aliasing channel information is a key link in the problem of voice separation. Under the actual condition, the interference of background noise can be overcome by processing the voice problem with the aid of the video signal, more accurate information of the speaking state of the speaker is obtained, and the defect that the mixed voice signal is processed by the single-mode audio signal in a noise and high reverberation environment is overcome.

In the actual recording situation, the voice signal is affected by the room reverberation effect and the interference of the background noise, and the recorded voice is often the result of aliasing synthesis of multiple fading paths, and can be mathematically described as a convolution aliasing model. Due to the influence of factors such as high reverberation and high background noise in the actual situation, an indoor voice convolution mixing system is complex, aliasing channel information is difficult to obtain, and great difficulty is brought to subsequent voice separation. In the aspect of single-mode audio signals, in order to solve the problem of aliasing channel estimation in reverberation and noise environments, methods for converting an observed speech signal into a time-frequency domain for batch processing are popular, such as the currently popular PARAFA-SC algorithm and the Bayes Ris-Min algorithm. However, for the problems of high reverberation and high noise in the real situation, the problem of mutual crosstalk between signal sources is easily caused in the prior art, and the final estimation of an aliasing channel is not ideal.

Disclosure of Invention

The invention aims to provide a multi-channel convolution aliasing voice channel estimation method combined with a video signal, which can solve the problem that the estimation performance of the existing algorithm on an aliasing channel is not ideal enough.

In order to realize the task, the invention adopts the following technical scheme:

a method for multi-channel convolution aliasing speech channel estimation in conjunction with a video signal, comprising the steps of:

collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; synthesizing a plurality of multi-channel convolution aliasing voice signals by using an audio database;

carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix; performing mathematical modeling on the multi-channel convolution aliasing voice signal in a time-frequency domain through short-time Fourier transform;

carrying out density clustering on image representation matrixes of single speaker column by column to search out a maximum density clustering center, setting a threshold value to obtain a neighbor data point subscript set of the maximum density clustering center, taking the neighbor data point subscript set as a speaker mouth silent state data set, and taking a complement of the data set as the speaker voice state data set; performing joint intersection operation on the silent state data sets and the sounding state data sets of a plurality of speakers to detect a local main guide set of a single speaker;

according to the local main guide set of a single speaker, time-frequency domain second-order covariance matrix sequences corresponding to corresponding time windows are respectively calculated, and main characteristic vectors are extracted from each-order covariance matrix to form an estimation aliasing channel.

Furthermore, the video data of a plurality of speakers are collected, and video images of mouth regions of the speakers are edited to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database, wherein the method comprises the following steps:

recording the front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, editing the video images of the mouth area of the speakers, and forming a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.

Further, the non-negative matrix decomposition is performed on the vectorized expression matrix of the video image of the mouth region of the speaker to obtain an image feature matrix and an image expression matrix, which are expressed as:

V_i＝W_iH_i

wherein, V_iA vectorized representation matrix for representing the video image of the mouth region of the speaker, wherein the image characteristic matrix is W_i＝[w_i,1,...,w_i,K]∈(R⁺)^P×KThe image representation matrix is H_i＝[h_i，1,...,h_i,Q]∈(R⁺)^K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q，H_iThe length of the die of all columns in the array being unit length, i.e.

Further, the mathematical modeling of the multi-channel convolution aliasing speech signal in the time-frequency domain by the short-time fourier transform is expressed as:

x_f,d＝A_fs_f,d+e_f,d

wherein A is_fIs an alias channel, s, at a frequency point f in the complex field_f,dIs the speech source component on the time frequency point (f, d), e_f,dIs gaussian noise.

Further, when the density clustering is carried out on the image representation matrix of the single speaker column by column, the evaluation index rho of the local density value of the ith speaker is calculated_iqExpressed as:

wherein phi is_i,qkDefined as an image representation matrix H_iCharacteristic row h_i,qAnd h_i,kThe euclidean distance between them,

is a preset Euclidean distance threshold value.

Further, the setting a threshold to obtain a subscript set of neighboring data points of a cluster center with the maximum density includes:

setting a distance threshold value mu, and marking all image representation vector data point subscript sets with the distance from the maximum density cluster center to be lower than the threshold value as phi_i。

Further, the step of respectively calculating the time-frequency domain second-order covariance matrix sequences corresponding to the corresponding time windows is represented as:

wherein g (Ψ)_i) Partial dominant set Ψ for a single speaker_iAnd converting the mapping function into a corresponding voice time-frequency frame set.

Further, the dominant eigenvector is the eigenvector corresponding to the largest eigenvalue.

Compared with the prior art, the invention has the following technical characteristics:

the method detects a local dominant time window of a single speaker in a video image by means of video image detection of a speaker mouth region and introducing a mathematical tool (a non-negative matrix decomposition and density clustering method), and meanwhile constructs a time-frequency domain voice local covariance statistical matrix from an audio signal and extracts a dominant feature vector so as to estimate an aliasing channel; series of experiments prove that the algorithm has better estimation performance than other single audio mode algorithms.

Drawings

FIG. 1 is a diagram of a clean speech signal;

FIG. 2 is a diagram of an aliased speech signal;

fig. 3 (a) and (b) are mouth images of the speaker 1 and the speaker 2, respectively;

FIG. 4 is a schematic diagram of the density clustering effect of the feature data of the mouth image of the speaker 1;

FIG. 5 is a schematic diagram of a single speaker local dominance detection effect based on a mouth representation matrix;

FIG. 6 is a schematic flow chart of the method of the present invention.

Detailed Description

The invention provides a multi-channel convolution aliasing voice channel estimation method combined with video signals. In video, the mouth region video signals of N speakers are represented as V₁,...,V_NIn which V is_i∈R^P×QFor vectorized representation of the ith personal mouth region video, P is the total pixel value of the video frame, Q represents the total number of video frames, i is 1, …, N. In audio, the convolutional speech aliasing system is x (t) ═ a × (t) + e (t), where x (t) ∈ R^MRepresenting the observed speech signals collected by M microphones, A ∈ R^M×N×LIs L-order aliasing channel matrix under reverberation condition, representing convolution symbol, s (t) epsilon R^NFor clean speech signals, e (t) e R^MIs the system noise; the invention aims to estimate a convolution aliasing voice channel A by combining video and audio signals.

Step 1, collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; a plurality of multi-channel convolved aliased speech signals are synthesized using an audio database.

Firstly, recording front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, and editing video images of a mouth area to form a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.

In the embodiment, three speech aliasing schemes are synthesized, wherein the number of microphones M is 2 and 3, respectively, and the number of speakers N is 2,3 and 4, respectively, and the three speech aliasing schemes are labeled as (M, N) ═ 2, (M, N) ═ 3, and (M, N) ═ 3, 4. The sampling rate of the recorded voice is fs 8000, and the acquisition length is 40 seconds. In addition, the distance between the microphones is set to be 0.05 meter, the distance between the speakers is set to be 0.4 meter, and the microphones are arrangedThe distance between the center and the center of the speaker is set to be 1.2 meters, and the reverberation time is respectively set to be: RT (reverse transcription)₆₀The room impulse response function is generated by Image-based RIR algorithm (J. allen and d. berkley, Image method for influencing small spatial interactions) 100ms, 150ms, 200ms, 250ms]J.acoust.soc.amer.,65(4), 1979.). The method comprises the steps that a Samsung I9100 mobile phone is adopted to record videos of a plurality of speakers, the sampling rate is fps equal to 25, and the size of each image is 90 times 110 pixel points; the short time fourier window function length is set to 2048.

Step 2, carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix so as to extract the characteristics of the video image of the mouth region of the speaker; a multi-channel convolved aliased speech signal is mathematically modeled in the time-frequency domain by a short-time fourier transform.

Because the video image array is large, the direct processing calculation amount in the image domain is large, and the algorithm complexity is increased. According to the scheme, the video image characteristic information is obtained through non-negative matrix decomposition, and the reduction of the image dimensionality of the mouth region is realized.

Vectorized representation matrix V of video image of mouth region of speaker_iA non-negative matrix factorization is performed, expressed as:

V_i＝W_iH_i

wherein the image feature matrix is W_i＝[w_i,1,...,w_i,K]∈(R⁺)^P×KThe image representation matrix is H_i＝[h_i，1,...,h_i,Q]∈(R⁺)^K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q。H_iThe length of the die of all columns in the array being unit length, i.e.

Mathematically modeling a multi-channel convolved aliased speech signal x (t) in the time-frequency domain using a short-time Fourier transform:

there are N signals (N2, 3,4), aliasing occurs when M microphones receive (M2, 3), and aliasing speech signal component x in time-frequency points (f, d)_f,dExpressed as:

x_f,d＝A_fs_f,d+e_f,d

wherein A is_f＝[a_f,1,...,a_f,N]Is an alias channel, s, at a frequency point f in the complex field_f,dIs the speech source component on the time frequency point (f, d), e_f,dIs gaussian noise.

Step 3, representing a matrix H for the image of a single speaker i_iCarrying out density clustering column by column, searching out the maximum density clustering center, and setting a threshold value mu to obtain an adjacent data point set phi of the maximum density clustering center_iIt is used as the data set for keeping silent state of i mouth of speaker, and the complementary set of the data set is used as the data set for speaking state of speaker

Performing union intersection operation on the silent state data sets and the sounding state data sets of the N speakers to detect a local main guide set of a single speaker, wherein the local main guide set is marked as psi₁,...,Ψ_N。

In this step, the local density value evaluation index of the ith speaker is calculated as: rho_iqQ1.., Q, expressed as:

for a preset Euclidean distance threshold, e.g. from the set of distances { phi ]_i,qk}_{q,k＝1,...,Q}Extracting the first 6% -8% of distance values (arranged from small to large) as a threshold value; extracting the local density value index rho for each speaker_i1,...,ρ_iN,i＝1,…,N。

Searching out maximum density clustering center and setting distance threshold mu to obtain speaker mouth silence state data set phi_iIn this embodiment, μ ≈ 0.3, and all image expression vector data point subscript sets with a distance from the maximum density cluster center lower than the threshold are labeled as Φ_i(i.e., silence state data set), and additionally marks the speaker voicing state data set as Φ_iThe complement of (1), is recorded as

The local main guide set of a single speaker is detected through intersection operation as follows:

and 4, respectively calculating a time-frequency domain second-order covariance matrix sequence corresponding to a corresponding time window according to a local dominant set of a single speaker, and extracting a dominant eigenvector from each order covariance matrix to form an estimated aliasing channel.

Time-frequency domain aliasing voice signal component x obtained by modeling by utilizing multi-channel convolution aliasing voice signal in step 2_f,dConstructing a local second-order covariance matrix as follows:

Performing eigenvalue decomposition on the local second-order covariance matrix, and extracting an eigenvector (dominant eigenvector) mark corresponding to the maximum eigenvalue

i 1.. N, to construct an estimated aliased channel

And implementing aliasing channel estimation.

The feasibility and superiority of the algorithm are illustrated by three specific simulation experiments, all of which are implemented in the programming environment of MacBook Air, Intel Core i5, CPU 1.8GHz, macOS 10.13.6 and Matlab R2018 b. First, the present solution uses the video/Audio data set provided by David Dov et al as the test set (David Dov, Ronen Talmon, and Israel Cohen, Audio-visual voice activity detection using differentiation maps [ J ]. IEEE/ACM Trans. Audio, Speech, Lang. Process.,23 (4)), 2015: 732-. In the data set, the scheme selects mouth movement videos and corresponding voice data of 4 speakers respectively, and constructs a video and audio test data set according to the step one. The clean speech signal waveform is shown in fig. 1 below, and the aliased speech waveform is shown in fig. 2 below. Video capture of a mouth region image of a speaker is shown in fig. 3, detection of a density clustering center effect graph through the third step is shown in fig. 4, and detection of a local dominant time window of a single speaker through the fourth step is shown in fig. 5.

In addition, the scheme compares the estimated aliasing channel precision as performance:

the smaller the error value, the higher the estimation accuracy.

The scheme takes into account different reverberation RTs₆₀Compared with two popular convolution channel estimation methods based on audio signals, namely Bayes-RisMin and PARAFAC-SC, the following convolution voice aliasing channel estimation problem is compared, and the aliasing channel estimation performance result is shown in the following table 1. Obviously, the performance of the convolution aliasing channel estimation method provided by the scheme is more excellent.

TABLE 1 different reverberation RTs₆₀Lower-alias channel estimation accuracy (MSE)_s)

Claims

1. A method for multi-channel convolution aliasing speech channel estimation in conjunction with a video signal, comprising the steps of:

carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix, wherein the expression is as follows:

V_i＝W_iH_i

Performing mathematical modeling on a multi-channel convolution aliasing speech signal in a time-frequency domain by short-time Fourier transform, and expressing as:

x_f,d＝A_fs_f,d+e_f,d

wherein A is_fIs an alias channel, s, at a frequency point f in the complex field_f,dIs the speech source component on the time frequency point (f, d), e_f,dIs gaussian noise; x is the number of_f,dRepresenting aliased speech signal components in the time bins (f, d);

wherein, when density clustering is carried out on the image expression matrix of a single speaker column by column, the evaluation index rho of the local density value of the ith speaker is calculated_iqExpressed as:

the Euclidean distance is a preset Euclidean distance threshold value;

respectively calculating time-frequency domain second-order covariance matrix sequences corresponding to corresponding time windows according to local dominant sets of a single speaker

Expressed as:

wherein g (Ψ)_i) Partial dominant set Ψ for a single speaker_iConverting the mapping function into a corresponding voice time-frequency frame set;

and extracting dominant eigenvectors from each order of covariance matrix to form an estimated aliasing channel.

2. The method as claimed in claim 1, wherein the method comprises collecting video data of multiple speakers, and editing video images of the mouth region of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database, wherein the method comprises the following steps:

3. The method of multi-channel convolution aliasing speech channel estimation for synthesizing a video signal according to claim 1, wherein said setting a threshold to obtain a set of nearest neighbor data point indices for a maximum density cluster center comprises:

4. The method of claim 1, wherein the dominant eigenvector is the eigenvector corresponding to the largest eigenvalue.