CN110706709B - Multi-channel convolution aliasing voice channel estimation method combined with video signal - Google Patents

Multi-channel convolution aliasing voice channel estimation method combined with video signal Download PDF

Info

Publication number
CN110706709B
CN110706709B CN201910816592.7A CN201910816592A CN110706709B CN 110706709 B CN110706709 B CN 110706709B CN 201910816592 A CN201910816592 A CN 201910816592A CN 110706709 B CN110706709 B CN 110706709B
Authority
CN
China
Prior art keywords
speaker
video
matrix
voice
aliasing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910816592.7A
Other languages
Chinese (zh)
Other versions
CN110706709A (en
Inventor
杨俊杰
杨祖元
谢胜利
杨超
解元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN201910816592.7A priority Critical patent/CN110706709B/en
Publication of CN110706709A publication Critical patent/CN110706709A/en
Application granted granted Critical
Publication of CN110706709B publication Critical patent/CN110706709B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L25/00Baseband systems
    • H04L25/02Details ; arrangements for supplying electrical power along data transmission lines
    • H04L25/0202Channel estimation

Abstract

The invention discloses a multi-channel convolution aliasing voice channel estimation method combined with video signals, which introduces a novel mathematical tool and an analysis method, fuses video and audio signal information and realizes effective estimation of a convolution voice aliasing channel. The method extracts speaker mouth shape characteristic data through non-negative matrix factorization by means of speaker mouth region video signals; and detecting a clustering center of the speaker mouth characteristic data by using a density clustering method, detecting an image frame of the speaker mouth in a silent state, and further extracting all time windows dominated by the single speaker voice. And according to the local dominant time window information, calculating a local dominant covariance matrix from the time-frequency domain observation voice signal components, and extracting a dominant eigenvector through eigenvalue decomposition, thereby realizing aliasing voice channel estimation. Compared with the aliasing voice channel estimation method under the current popular single-mode audio, the method proves the superiority of the method from numerical experiments.

Description

Multi-channel convolution aliasing voice channel estimation method combined with video signal
Technical Field
The invention relates to the field of voice signal processing, in particular to a multi-channel convolution aliasing voice channel estimation method combined with a video signal.
Background
The task of Audio Speech Separation (ASS) is to separate the voice of a target speaker from a mixed Speech signal of multiple speakers received by a microphone by means of signal processing. This is a very challenging issue in the field of signal processing. Before the complete separation of voice is realized, obtaining aliasing channel information is a key link in the problem of voice separation. Under the actual condition, the interference of background noise can be overcome by processing the voice problem with the aid of the video signal, more accurate information of the speaking state of the speaker is obtained, and the defect that the mixed voice signal is processed by the single-mode audio signal in a noise and high reverberation environment is overcome.
In the actual recording situation, the voice signal is affected by the room reverberation effect and the interference of the background noise, and the recorded voice is often the result of aliasing synthesis of multiple fading paths, and can be mathematically described as a convolution aliasing model. Due to the influence of factors such as high reverberation and high background noise in the actual situation, an indoor voice convolution mixing system is complex, aliasing channel information is difficult to obtain, and great difficulty is brought to subsequent voice separation. In the aspect of single-mode audio signals, in order to solve the problem of aliasing channel estimation in reverberation and noise environments, methods for converting an observed speech signal into a time-frequency domain for batch processing are popular, such as the currently popular PARAFA-SC algorithm and the Bayes Ris-Min algorithm. However, for the problems of high reverberation and high noise in the real situation, the problem of mutual crosstalk between signal sources is easily caused in the prior art, and the final estimation of an aliasing channel is not ideal.
Disclosure of Invention
The invention aims to provide a multi-channel convolution aliasing voice channel estimation method combined with a video signal, which can solve the problem that the estimation performance of the existing algorithm on an aliasing channel is not ideal enough.
In order to realize the task, the invention adopts the following technical scheme:
a method for multi-channel convolution aliasing speech channel estimation in conjunction with a video signal, comprising the steps of:
collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; synthesizing a plurality of multi-channel convolution aliasing voice signals by using an audio database;
carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix; performing mathematical modeling on the multi-channel convolution aliasing voice signal in a time-frequency domain through short-time Fourier transform;
carrying out density clustering on image representation matrixes of single speaker column by column to search out a maximum density clustering center, setting a threshold value to obtain a neighbor data point subscript set of the maximum density clustering center, taking the neighbor data point subscript set as a speaker mouth silent state data set, and taking a complement of the data set as the speaker voice state data set; performing joint intersection operation on the silent state data sets and the sounding state data sets of a plurality of speakers to detect a local main guide set of a single speaker;
according to the local main guide set of a single speaker, time-frequency domain second-order covariance matrix sequences corresponding to corresponding time windows are respectively calculated, and main characteristic vectors are extracted from each-order covariance matrix to form an estimation aliasing channel.
Furthermore, the video data of a plurality of speakers are collected, and video images of mouth regions of the speakers are edited to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database, wherein the method comprises the following steps:
recording the front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, editing the video images of the mouth area of the speakers, and forming a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.
Further, the non-negative matrix decomposition is performed on the vectorized expression matrix of the video image of the mouth region of the speaker to obtain an image feature matrix and an image expression matrix, which are expressed as:
Vi=WiHi
wherein, ViA vectorized representation matrix for representing the video image of the mouth region of the speaker, wherein the image characteristic matrix is Wi=[wi,1,...,wi,K]∈(R+)P×KThe image representation matrix is Hi=[hi,1,...,hi,Q]∈(R+)K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q,HiThe length of the die of all columns in the array being unit length, i.e.
Figure GDA0003230784370000021
Further, the mathematical modeling of the multi-channel convolution aliasing speech signal in the time-frequency domain by the short-time fourier transform is expressed as:
xf,d=Afsf,d+ef,d
wherein A isfIs an alias channel, s, at a frequency point f in the complex fieldf,dIs the speech source component on the time frequency point (f, d), ef,dIs gaussian noise.
Further, when the density clustering is carried out on the image representation matrix of the single speaker column by column, the evaluation index rho of the local density value of the ith speaker is calculatediqExpressed as:
Figure GDA0003230784370000031
wherein phi isi,qkDefined as an image representation matrix HiCharacteristic row hi,qAnd hi,kThe euclidean distance between them,
Figure GDA0003230784370000032
is a preset Euclidean distance threshold value.
Further, the setting a threshold to obtain a subscript set of neighboring data points of a cluster center with the maximum density includes:
setting a distance threshold value mu, and marking all image representation vector data point subscript sets with the distance from the maximum density cluster center to be lower than the threshold value as phii
Further, the step of respectively calculating the time-frequency domain second-order covariance matrix sequences corresponding to the corresponding time windows is represented as:
Figure GDA0003230784370000033
wherein g (Ψ)i) Partial dominant set Ψ for a single speakeriAnd converting the mapping function into a corresponding voice time-frequency frame set.
Further, the dominant eigenvector is the eigenvector corresponding to the largest eigenvalue.
Compared with the prior art, the invention has the following technical characteristics:
the method detects a local dominant time window of a single speaker in a video image by means of video image detection of a speaker mouth region and introducing a mathematical tool (a non-negative matrix decomposition and density clustering method), and meanwhile constructs a time-frequency domain voice local covariance statistical matrix from an audio signal and extracts a dominant feature vector so as to estimate an aliasing channel; series of experiments prove that the algorithm has better estimation performance than other single audio mode algorithms.
Drawings
FIG. 1 is a diagram of a clean speech signal;
FIG. 2 is a diagram of an aliased speech signal;
fig. 3 (a) and (b) are mouth images of the speaker 1 and the speaker 2, respectively;
FIG. 4 is a schematic diagram of the density clustering effect of the feature data of the mouth image of the speaker 1;
FIG. 5 is a schematic diagram of a single speaker local dominance detection effect based on a mouth representation matrix;
FIG. 6 is a schematic flow chart of the method of the present invention.
Detailed Description
The invention provides a multi-channel convolution aliasing voice channel estimation method combined with video signals. In video, the mouth region video signals of N speakers are represented as V1,...,VNIn which V isi∈RP×QFor vectorized representation of the ith personal mouth region video, P is the total pixel value of the video frame, Q represents the total number of video frames, i is 1, …, N. In audio, the convolutional speech aliasing system is x (t) ═ a × (t) + e (t), where x (t) ∈ RMRepresenting the observed speech signals collected by M microphones, A ∈ RM×N×LIs L-order aliasing channel matrix under reverberation condition, representing convolution symbol, s (t) epsilon RNFor clean speech signals, e (t) e RMIs the system noise; the invention aims to estimate a convolution aliasing voice channel A by combining video and audio signals.
Step 1, collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; a plurality of multi-channel convolved aliased speech signals are synthesized using an audio database.
Firstly, recording front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, and editing video images of a mouth area to form a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.
In the embodiment, three speech aliasing schemes are synthesized, wherein the number of microphones M is 2 and 3, respectively, and the number of speakers N is 2,3 and 4, respectively, and the three speech aliasing schemes are labeled as (M, N) ═ 2, (M, N) ═ 3, and (M, N) ═ 3, 4. The sampling rate of the recorded voice is fs 8000, and the acquisition length is 40 seconds. In addition, the distance between the microphones is set to be 0.05 meter, the distance between the speakers is set to be 0.4 meter, and the microphones are arrangedThe distance between the center and the center of the speaker is set to be 1.2 meters, and the reverberation time is respectively set to be: RT (reverse transcription)60The room impulse response function is generated by Image-based RIR algorithm (J. allen and d. berkley, Image method for influencing small spatial interactions) 100ms, 150ms, 200ms, 250ms]J.acoust.soc.amer.,65(4), 1979.). The method comprises the steps that a Samsung I9100 mobile phone is adopted to record videos of a plurality of speakers, the sampling rate is fps equal to 25, and the size of each image is 90 times 110 pixel points; the short time fourier window function length is set to 2048.
Step 2, carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix so as to extract the characteristics of the video image of the mouth region of the speaker; a multi-channel convolved aliased speech signal is mathematically modeled in the time-frequency domain by a short-time fourier transform.
Because the video image array is large, the direct processing calculation amount in the image domain is large, and the algorithm complexity is increased. According to the scheme, the video image characteristic information is obtained through non-negative matrix decomposition, and the reduction of the image dimensionality of the mouth region is realized.
Vectorized representation matrix V of video image of mouth region of speakeriA non-negative matrix factorization is performed, expressed as:
Vi=WiHi
wherein the image feature matrix is Wi=[wi,1,...,wi,K]∈(R+)P×KThe image representation matrix is Hi=[hi,1,...,hi,Q]∈(R+)K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q。HiThe length of the die of all columns in the array being unit length, i.e.
Figure GDA0003230784370000051
Mathematically modeling a multi-channel convolved aliased speech signal x (t) in the time-frequency domain using a short-time Fourier transform:
there are N signals (N2, 3,4), aliasing occurs when M microphones receive (M2, 3), and aliasing speech signal component x in time-frequency points (f, d)f,dExpressed as:
xf,d=Afsf,d+ef,d
wherein A isf=[af,1,...,af,N]Is an alias channel, s, at a frequency point f in the complex fieldf,dIs the speech source component on the time frequency point (f, d), ef,dIs gaussian noise.
Step 3, representing a matrix H for the image of a single speaker iiCarrying out density clustering column by column, searching out the maximum density clustering center, and setting a threshold value mu to obtain an adjacent data point set phi of the maximum density clustering centeriIt is used as the data set for keeping silent state of i mouth of speaker, and the complementary set of the data set is used as the data set for speaking state of speaker
Figure GDA0003230784370000061
Performing union intersection operation on the silent state data sets and the sounding state data sets of the N speakers to detect a local main guide set of a single speaker, wherein the local main guide set is marked as psi1,...,ΨN
In this step, the local density value evaluation index of the ith speaker is calculated as: rhoiqQ1.., Q, expressed as:
Figure GDA0003230784370000062
wherein phi isi,qkDefined as an image representation matrix HiCharacteristic row hi,qAnd hi,kThe euclidean distance between them,
Figure GDA0003230784370000063
for a preset Euclidean distance threshold, e.g. from the set of distances { phi ]i,qk}q,k=1,...,QExtracting the first 6% -8% of distance values (arranged from small to large) as a threshold value; extracting the local density value index rho for each speakeri1,...,ρiN,i=1,…,N。
Searching out maximum density clustering center and setting distance threshold mu to obtain speaker mouth silence state data set phiiIn this embodiment, μ ≈ 0.3, and all image expression vector data point subscript sets with a distance from the maximum density cluster center lower than the threshold are labeled as Φi(i.e., silence state data set), and additionally marks the speaker voicing state data set as ΦiThe complement of (1), is recorded as
Figure GDA0003230784370000064
The local main guide set of a single speaker is detected through intersection operation as follows:
Figure GDA0003230784370000065
and 4, respectively calculating a time-frequency domain second-order covariance matrix sequence corresponding to a corresponding time window according to a local dominant set of a single speaker, and extracting a dominant eigenvector from each order covariance matrix to form an estimated aliasing channel.
Time-frequency domain aliasing voice signal component x obtained by modeling by utilizing multi-channel convolution aliasing voice signal in step 2f,dConstructing a local second-order covariance matrix as follows:
Figure GDA0003230784370000066
wherein g (Ψ)i) Partial dominant set Ψ for a single speakeriAnd converting the mapping function into a corresponding voice time-frequency frame set.
Performing eigenvalue decomposition on the local second-order covariance matrix, and extracting an eigenvector (dominant eigenvector) mark corresponding to the maximum eigenvalue
Figure GDA0003230784370000071
i 1.. N, to construct an estimated aliased channel
Figure GDA0003230784370000072
And implementing aliasing channel estimation.
The feasibility and superiority of the algorithm are illustrated by three specific simulation experiments, all of which are implemented in the programming environment of MacBook Air, Intel Core i5, CPU 1.8GHz, macOS 10.13.6 and Matlab R2018 b. First, the present solution uses the video/Audio data set provided by David Dov et al as the test set (David Dov, Ronen Talmon, and Israel Cohen, Audio-visual voice activity detection using differentiation maps [ J ]. IEEE/ACM Trans. Audio, Speech, Lang. Process.,23 (4)), 2015: 732-. In the data set, the scheme selects mouth movement videos and corresponding voice data of 4 speakers respectively, and constructs a video and audio test data set according to the step one. The clean speech signal waveform is shown in fig. 1 below, and the aliased speech waveform is shown in fig. 2 below. Video capture of a mouth region image of a speaker is shown in fig. 3, detection of a density clustering center effect graph through the third step is shown in fig. 4, and detection of a local dominant time window of a single speaker through the fourth step is shown in fig. 5.
In addition, the scheme compares the estimated aliasing channel precision as performance:
Figure GDA0003230784370000073
the smaller the error value, the higher the estimation accuracy.
The scheme takes into account different reverberation RTs60Compared with two popular convolution channel estimation methods based on audio signals, namely Bayes-RisMin and PARAFAC-SC, the following convolution voice aliasing channel estimation problem is compared, and the aliasing channel estimation performance result is shown in the following table 1. Obviously, the performance of the convolution aliasing channel estimation method provided by the scheme is more excellent.
TABLE 1 different reverberation RTs60Lower-alias channel estimation accuracy (MSE)s)
Figure GDA0003230784370000074

Claims (4)

1. A method for multi-channel convolution aliasing speech channel estimation in conjunction with a video signal, comprising the steps of:
collecting video data of a plurality of speakers, and editing video images of mouth regions of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database; synthesizing a plurality of multi-channel convolution aliasing voice signals by using an audio database;
carrying out non-negative matrix decomposition on the vectorization expression matrix of the video image of the mouth region of the speaker to respectively obtain an image characteristic matrix and an image expression matrix, wherein the expression is as follows:
Vi=WiHi
wherein, ViA vectorized representation matrix for representing the video image of the mouth region of the speaker, wherein the image characteristic matrix is Wi=[wi,1,...,wi,K]∈(R+)P×KThe image representation matrix is Hi=[hi,1,...,hi,Q]∈(R+)K×QWherein i represents the ith speaker, P is the total pixel value of the video frame, K is the number of columns of the image characteristic matrix, Q is the number of columns of the image representation matrix, R is the real number set, K<<Q,HiThe length of the die of all columns in the array being unit length, i.e.
Figure FDA0003230784360000011
Performing mathematical modeling on a multi-channel convolution aliasing speech signal in a time-frequency domain by short-time Fourier transform, and expressing as:
xf,d=Afsf,d+ef,d
wherein A isfIs an alias channel, s, at a frequency point f in the complex fieldf,dIs the speech source component on the time frequency point (f, d), ef,dIs gaussian noise; x is the number off,dRepresenting aliased speech signal components in the time bins (f, d);
carrying out density clustering on image representation matrixes of single speaker column by column to search out a maximum density clustering center, setting a threshold value to obtain a neighbor data point subscript set of the maximum density clustering center, taking the neighbor data point subscript set as a speaker mouth silent state data set, and taking a complement of the data set as the speaker voice state data set; performing joint intersection operation on the silent state data sets and the sounding state data sets of a plurality of speakers to detect a local main guide set of a single speaker;
wherein, when density clustering is carried out on the image expression matrix of a single speaker column by column, the evaluation index rho of the local density value of the ith speaker is calculatediqExpressed as:
Figure FDA0003230784360000021
wherein phi isi,qkDefined as an image representation matrix HiCharacteristic row hi,qAnd hi,kThe euclidean distance between them,
Figure FDA0003230784360000022
the Euclidean distance is a preset Euclidean distance threshold value;
respectively calculating time-frequency domain second-order covariance matrix sequences corresponding to corresponding time windows according to local dominant sets of a single speaker
Figure FDA0003230784360000023
Expressed as:
Figure FDA0003230784360000024
wherein g (Ψ)i) Partial dominant set Ψ for a single speakeriConverting the mapping function into a corresponding voice time-frequency frame set;
and extracting dominant eigenvectors from each order of covariance matrix to form an estimated aliasing channel.
2. The method as claimed in claim 1, wherein the method comprises collecting video data of multiple speakers, and editing video images of the mouth region of the speakers to form a video database; meanwhile, recording the voice signal of each speaker, and constructing an audio database, wherein the method comprises the following steps:
recording the front speaking videos of a plurality of speakers through a camera, keeping a certain pause when the speakers recite each sentence, editing the video images of the mouth area of the speakers, and forming a video database; and recording a pure voice signal of a speaker through a microphone while recording the video to construct an audio database.
3. The method of multi-channel convolution aliasing speech channel estimation for synthesizing a video signal according to claim 1, wherein said setting a threshold to obtain a set of nearest neighbor data point indices for a maximum density cluster center comprises:
setting a distance threshold value mu, and marking all image representation vector data point subscript sets with the distance from the maximum density cluster center to be lower than the threshold value as phii
4. The method of claim 1, wherein the dominant eigenvector is the eigenvector corresponding to the largest eigenvalue.
CN201910816592.7A 2019-08-30 2019-08-30 Multi-channel convolution aliasing voice channel estimation method combined with video signal Active CN110706709B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910816592.7A CN110706709B (en) 2019-08-30 2019-08-30 Multi-channel convolution aliasing voice channel estimation method combined with video signal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910816592.7A CN110706709B (en) 2019-08-30 2019-08-30 Multi-channel convolution aliasing voice channel estimation method combined with video signal

Publications (2)

Publication Number Publication Date
CN110706709A CN110706709A (en) 2020-01-17
CN110706709B true CN110706709B (en) 2021-11-19

Family

ID=69193509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910816592.7A Active CN110706709B (en) 2019-08-30 2019-08-30 Multi-channel convolution aliasing voice channel estimation method combined with video signal

Country Status (1)

Country Link
CN (1) CN110706709B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114567527B (en) * 2022-03-10 2023-04-28 西华大学 Reconfigurable intelligent surface auxiliary superposition guide fusion learning channel estimation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150066486A1 (en) * 2013-08-28 2015-03-05 Accusonus S.A. Methods and systems for improved signal decomposition
CN105335755A (en) * 2015-10-29 2016-02-17 武汉大学 Media segment-based speaking detection method and system
CN109671447A (en) * 2018-11-28 2019-04-23 广东工业大学 A kind of binary channels is deficient to determine Convolution Mixture Signals blind signals separation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150066486A1 (en) * 2013-08-28 2015-03-05 Accusonus S.A. Methods and systems for improved signal decomposition
CN105335755A (en) * 2015-10-29 2016-02-17 武汉大学 Media segment-based speaking detection method and system
CN109671447A (en) * 2018-11-28 2019-04-23 广东工业大学 A kind of binary channels is deficient to determine Convolution Mixture Signals blind signals separation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Underdetermined Blind Source Separation for Heart Sound Using Higher-Order Statistics and Sparse Representation》;Yuan Xie et al.;《IEEE Access》;20190701;第7卷;第87606-87616页 *
《一种基于非负矩阵分解的改进FastICA盲源分离方法》;王艳芳等;《江苏科技大学学报(自然科学版)》;20180430;第32卷(第2期);第232-236页 *

Also Published As

Publication number Publication date
CN110706709A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
Adavanne et al. A multi-room reverberant dataset for sound event localization and detection
Tan et al. Audio-visual speech separation and dereverberation with a two-stage multimodal network
JP5231139B2 (en) Sound source extraction device
JP2017044916A (en) Sound source identifying apparatus and sound source identifying method
CN111899756B (en) Single-channel voice separation method and device
He et al. Neural network adaptation and data augmentation for multi-speaker direction-of-arrival estimation
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
Kang et al. Multimodal speaker diarization of real-world meetings using d-vectors with spatial features
Sato et al. Multimodal attention fusion for target speaker extraction
Rachavarapu et al. Localize to binauralize: Audio spatialization from visual sound source localization
CN111868823A (en) Sound source separation method, device and equipment
CN110706709B (en) Multi-channel convolution aliasing voice channel estimation method combined with video signal
CN110265060B (en) Speaker number automatic detection method based on density clustering
Papayiannis et al. Detecting Media Sound Presence in Acoustic Scenes.
Zhang et al. Multi-Target Ensemble Learning for Monaural Speech Separation.
JP3949074B2 (en) Objective signal extraction method and apparatus, objective signal extraction program and recording medium thereof
CN117169812A (en) Sound source positioning method based on deep learning and beam forming
Chazan et al. Attention-based neural network for joint diarization and speaker extraction
JP6404780B2 (en) Wiener filter design apparatus, sound enhancement apparatus, acoustic feature quantity selection apparatus, method and program thereof
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model
CN114822584A (en) Transmission device signal separation method based on integral improved generalized cross-correlation
Jafari et al. Underdetermined blind source separation with fuzzy clustering for arbitrarily arranged sensors
Dov et al. Multimodal kernel method for activity detection of sound sources
Krijnders et al. Tone-fit and MFCC scene classification compared to human recognition
Bergh et al. Multi-speaker voice activity detection using a camera-assisted microphone array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant