CN116469404A - Audio-visual cross-mode fusion voice separation method - Google Patents

Audio-visual cross-mode fusion voice separation method Download PDF

Info

Publication number
CN116469404A
CN116469404A CN202310430709.4A CN202310430709A CN116469404A CN 116469404 A CN116469404 A CN 116469404A CN 202310430709 A CN202310430709 A CN 202310430709A CN 116469404 A CN116469404 A CN 116469404A
Authority
CN
China
Prior art keywords
audio
visual
fusion
separation
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310430709.4A
Other languages
Chinese (zh)
Inventor
兰朝凤
赵世龙
蒋朋威
郭锐
郭小霞
韩玉兰
韩闯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202310430709.4A priority Critical patent/CN116469404A/en
Publication of CN116469404A publication Critical patent/CN116469404A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Stereophonic System (AREA)

Abstract

The existing audio-visual voice separation model basically performs simple splicing on video and audio characteristics, visual information is not fully utilized, and a separation effect is not ideal. The invention fully considers the interrelation between visual and audio characteristics, adopts a multi-head attention mechanism, combines a convolution time-domain separation model (Conv-TasNet) and a Dual-path recurrent neural network (Dual-PathRecurrentNeuralNetwork, DPRNN), and provides a time-domain audio-visual cross-modal fusion speech separation (Conv-AVSS) model. The model obtains audio features and lip features through an audio encoder and a visual encoder, adopts a multi-head attention mechanism to perform cross-modal fusion on audio-visual features, and obtains separated voices of different speakers through a DPRNN separation network. And (3) performing experimental tests on the VoxCeleb2 data set by using the PESQ, STOI and SDR evaluation indexes. Research shows that when the mixed voice of two, three or four speakers is separated, compared with the traditional separation network, the SDR improvement amount of the method is more than 1.87dB and can reach 2.29dB at most. This demonstrates the effectiveness of the present method.

Description

Audio-visual cross-mode fusion voice separation method
Technical Field
The invention relates to an audio-visual cross-mode fusion voice separation method, and belongs to the field of voice separation.
Background
Speech exists as a most convenient and accurate way of information communication and emotion expression, and plays an important role in promoting human society development. The development of the voice processing technology promotes the progress of the voice man-machine interaction technology, and the capability of human interaction with the intelligent terminal is improved. Speech processing includes speech separation, speech enhancement, speech recognition, natural language understanding, etc., where speech separation is the front-end processing of speech technology and the results of speech separation affect the quality of subsequent interactive links, so speech separation is of increasing interest to students.
Speech separation comes from the "cocktail party problem" and in complex noisy environments one can hear the sound of interest. Conventional speech separation techniques are mainly based on signal processing and statistical methods, and common single-channel speech separation methods include independent component analysis (Independent Component Analysis, ICA), non-negative matrix factorization (Non-negative Matrix Factorization, NMF) and computational auditory scene analysis (Computational Auditory Scene Analysis, CASA). The traditional voice separation method has the defects of difficult algorithm optimization and long training time, and meanwhile, the traditional voice separation method needs prior information of voice, so that the further improvement of separation performance is limited. With the rapid development of Deep learning, the technology update of the field of voice separation is promoted by data mining of depth information, such as Deep Clustering (DPCL), substitution-invariant training (Permutation Invariant Training, PIT), speech-level substitution-invariant training method (Utterance-level Permutation Invariant Training, uPIT), pure voice separation input information based on Deep learning only contains audio information, and is in face of more complex real scenes, interference information is increased, and separation performance is easily affected.
In crowded restaurants and noisy bars, people can only pay attention to the sound of interest, and ignore external interference, and the voice perception capability in the complex scene not only depends on the human auditory system, but also is beneficial to the visual system, so that the multi-sense perception system of the human is promoted to process the complex environment. Psychological studies have shown that facial expressions or lip movements of a speaker can affect the processing of sounds by the brain of a person, and visual information plays an important role in conversations and communications, so observing the speaker's lip movements can help people understand the meaning of a speaker in noisy environments. In light of the above, studies of multimodal active speaker detection, audio-visual speech separation, audio-visual synchronization, and the like based on audio-visual fusion have been sequentially proposed, and the audio-visual fusion speech separation method has become a new research hotspot.
For multi-speaker voice separation, because the number of speakers is large, the image information has large calculation amount and high model complexity, the phenomenon of over fitting or under fitting is easy to occur, and meanwhile, in the audio-visual voice separation process, the visual information only plays an auxiliary role, so that the key point of multi-speaker voice separation research is still an audio signal. If the audio signal can be utilized to the greatest extent, the separation effect will be greatly improved, so that the end-to-end voice separation method is sequentially proposed. The input and output of the end-to-end voice separation method are time domain voice signals, and short time Fourier transform (Short Time Fourier Transform, STFT) is not needed to convert the time domain signals into the frequency domain, so that the phase information of the audio signals can be utilized, and the voice separation effect is improved. The end-to-end voice separation method is firstly suitable for pure voice separation, luo et al sequentially propose a Time-domain separation network (Time-domain Audio Separation Network, tasNet), a convolution Time-domain separation network (Convolutional Time-domain Audio Separation network, conv-TasNet) and a Dual-path recurrent neural network (Dual-path Recurrent Neural Network, DPRNN), and with the development of a multi-mode audio-visual voice separation technology, researchers combine the end-to-end method and the audio-visual voice separation method to realize the Time-domain audio-visual voice separation from end to end.
Wu et al propose a time domain audio-visual speech separation model, which uses Conv-TasNet network structure for the audio part, uses an encoder to obtain audio features, uses residual network (Residual Neural Network, resNet) for the video part to extract visual features, and uses CNN to extract lip images in the visual features, which has a slightly larger calculation amount because the lip images contain visual information irrelevant to audio. The range et al uses the difference between the mixed speech and the network output signal to realize the calculation of multiple paths of speech. Xu Liang et al propose a multi-feature fusion audio-visual speech separation model, wherein a visual part adopts a method of multi-feature extraction to obtain more visual features containing speech information, and an audio-visual fusion part adopts a method of multi-fusion, and a TCN (TCN) network adopted by the model separation network is limited by a convolution receptive field when facing to an ultra-long speech sequence. Gao et al propose a multitask modeling strategy that utilizes an expanded convolutional network (infted 3d convnet, i3 d) model to obtain lip movement optical flow information, establishes matching of faces and sounds by learning cross-modal embedding, and effectively solves the problem of audiovisual inconsistency by correlation of faces and sounds. Xiong et al applies a multitasking modeling foundation to audio-visual fusion, utilizes a lightweight network SheffleNet v2 to extract lip features, and simultaneously provides audio-visual voice separation based on joint feature representation of cross-modal attention based on a self-attention mechanism, thereby improving the utilization rate of visual information. Zhang et al propose an audiovisual speech separation network for visual characterization of resistance entanglement, which employs a method of resistance entanglement to extract visual features related to speech from visual input and use them to assist in speech separation, which effectively reduces input of image data, but at the audiovisual fusion portion, feature stitching at the convolutional layer does not fully exploit the visual features. Wu et al also propose a low quality time domain audio-visual speech separation model, which uses the attention mechanism to select visual features related to audio features for low quality video, and combines with multi-modal fusion based on Conv-TasNet model, and when training is performed using low quality data, a better separation result is obtained.
Although the above-mentioned time-domain audio-visual speech separation takes good speech separation performance, there is a problem that audio-visual fusion is simple or complete sequence extraction cannot be performed in the face of long speech sequences on an audio-visual feature fusion or separation network.
Disclosure of Invention
Aiming at the problems that the audio-visual fusion is simple and complete sequence extraction cannot be carried out when long voice sequences are faced, the invention provides an audio-visual cross-modal fusion voice separation method based on a dual-path recursion network and Conv-TasNet.
The invention discloses an audio-visual cross-mode fusion voice separation method, which comprises the following steps:
s1, obtaining audio features and lip features of a video by using an audio encoder and a visual encoder;
s2, performing cross-modal fusion on the audio features and the visual features by adopting a multi-head attention mechanism to obtain audio-visual fusion features;
s3, processing the audio-visual fusion characteristics by using a DPRNN separation network to obtain separation voices of different speakers.
Preferably, the S1 includes:
s11, the visual encoder is composed of a lip-embedded extractor and a temporal convolution block, wherein the lip-embedded extractor is composed of a 3D convolution layer and a residual network of 18 layers. The visual encoder generates a dimension k by embedding an extractor and a time convolution block in a lip v Is a lip feature vector f v V represents a lip image;
s12, the audio encoder consists of one-dimensional convolution, and the one-dimensional convolution is used for replacing STFT to generate a dimension k a Is the audio feature vector f of (1) a A denotes input audio.
Preferably, in the step S1, the time convolution block consists of a time convolution, BN, reLU activation function and downsampling, and the 256-dimensional feature vector l is input v ,l v And (3) representing the lip image, and inhibiting the problems of gradient explosion and gradient disappearance through the ReLU activation function and BN treatment. Downsampling the pair of eigenvectors l v And performing dimension reduction treatment. The lip feature vector of the input video image processed by the visual encoder is that
f v =F(Conv1D(v,L v ,S v ))
Wherein Conv1D (·) represents the convolution operation, v represents the lip image, L v Representing the convolution kernel size, S v Denote the convolution step size, F (·) denotes the ReLU function.
Preferably, in the step S2, the multi-speaker cross-modal fusion module based on the multi-head attention mechanism first splices visual features of different speakers output by the visual encoder, then performs cross-modal fusion on the spliced visual features and audio features, and finally outputs a dimension k av Is of the audio-visual characteristics f av Av represents an audio and visual fusion.
Preferably, in the step S3, DPRNN is used as a separation network, and the DPRNN separation network first uses the input audio-visual feature f av Dividing to obtain divided audio-visual fusion blocks, inputting the audio-visual fusion blocks into a BiLSTM network for inter-block processing, and then carrying out superposition reduction on the processed audio-visual fusion blocks to output the prediction mask M of each speaker i I= (1, 2, …, n), n being the number of speakers, prediction mask M i Dimension of (2) and audio feature vector f a The dimensions are the same, and finally the composite mask M i Output f of audio encoder a Multiplied by each other, input to a decoder, and the predicted speaker audio is restored by the decoder.
The invention fully considers the interrelation between visual characteristics and audio characteristics, adopts a multi-head attention mechanism, combines a convolution time domain separation model Conv-TasNet and a dual-path recurrent neural network DPRNN, and provides a time domain audio-visual cross-modal fusion voice separation Conv-AVSS model. The audio feature and the lip feature of the video are obtained through the audio encoder and the visual encoder, the audio feature and the visual feature are subjected to cross-modal fusion by adopting a multi-head attention mechanism, the audio-visual fusion feature is obtained, and the audio-visual fusion feature is subjected to a DPRNN separation network to obtain the separated voices of different speakers. And (3) performing experimental tests on the VoxCeleb2 data set by using the PESQ, STOI and SDR evaluation indexes. The research shows that when the mixed voice of two, three or four speakers is separated, compared with the traditional separation network, the SDR improvement is more than 1.87dB and can reach 2.29dB at most. Therefore, the invention can consider the phase information of the audio signal, better utilize the relativity of visual information and audio information, extract more accurate audio-visual characteristics and obtain better separation effect.
Drawings
FIG. 1 is a frame diagram of a Conv-TasNet speech separation architecture;
FIG. 2 is a diagram of a cross-modal fusion speech separation model based on Conv-TasNet time domain audio-visual;
FIG. 3 is a block diagram of a visual encoder;
FIG. 4 is a block diagram of an audio encoder;
FIG. 5 is an overall block diagram of a cross-modality fusion module;
FIG. 6 is a cross-modal attention fusion strategy graph;
FIG. 7 is a diagram of a DPRNN split network architecture;
FIG. 8 is a block diagram of the PESQ algorithm;
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
The invention is further described below with reference to the drawings and specific examples, which are not intended to be limiting. According to the audio-visual cross-modal fusion voice separation method based on the dual-path recursion network and the Conv-TasNet, firstly, audio features and lip features of a video are obtained through an audio encoder and a visual encoder, secondly, multi-head attention mechanisms are adopted to carry out cross-modal fusion on the audio features and the visual features, audio-visual fusion features are obtained, and finally, audio-visual fusion is carried out through a DPRNN separation network, so that separated voices of different speakers are obtained. The overall system block diagram is shown in fig. 2. The implementation of cross-mode audio-visual voice separation in the embodiment comprises the following steps:
s1, obtaining audio features and lip features of video by using an audio encoder and a visual encoder, wherein the method comprises the following steps:
s11, the visual encoder is composed of a lip-embedded extractor and a temporal convolution block, wherein the lip-embedded extractor is composed of a 3D convolution layer and a residual network of 18 layers. The visual encoder generates a dimension k by embedding an extractor and a time convolution block in a lip v Is a lip feature vector f v V represents a lip image;
s12, the audio encoder consists of one-dimensional convolution, and the one-dimensional convolution is used for replacing STFT to generate a dimension k a Is the audio feature vector f of (1) a A denotes input audio.
The implementation is based on a pure voice separation Conv-TasNet network architecture, and a voice separation model is constructed. The Conv-TasNet network is mainly composed of an encoder, a time convolution separation network and a decoder, as shown in fig. 1.
In fig. 1, conv-TasNet uses an encoder instead of STFT to obtain audio characteristics, and since the encoder input is directly a mixed speech waveform, no time-frequency conversion is required, and thus phase information of the audio signal can be used. The time convolution separation network calculates the masking of each speaker through the audio characteristics output by the encoder, the masking output by the time convolution separation network is multiplied with the audio characteristics output by the encoder, and then the separated voice is obtained through the decoder, and the decoder acts like an ISTFT.
The implementation improves Conv-TasNet network, adds a visual encoder, combines a cross-modal fusion method based on an attention mechanism and a DPRNN separation network, and provides a time domain audio-visual cross-modal fusion voice separation Conv-AVSS model, wherein the structure of the Conv-AVSS model is shown in figure 2.
The speech separation model of fig. 2 consists essentially of four parts: a visual encoder, an audio encoder/decoder, a multi-speaker cross-modality fusion module, and a separation network, respectively. The visual encoder consists of a lip-embedded extractor consisting of 3D convolutional layers and a residual network of 18 layers, and a temporal convolution block consisting of one temporal convolution, a ReLU activation function, and BN. The visual encoder generates a dimension k by embedding an extractor and a time convolution block in a lip v Is a lip feature vector f v V denotes a lip image. The audio encoder consists of one-dimensional convolution, and generates a dimension k by using the one-dimensional convolution to replace STFT a Is the audio feature vector f of (1) a A denotes input audio.
In order to fully consider the correlation among different modes and realize the joint expression among different modes, the implementation provides a cross-mode fusion module based on an attention mechanism, the multi-speaker cross-mode fusion module firstly splices the visual characteristics of different speakers output by a visual encoder, then carries out cross-mode fusion on the spliced visual characteristics and audio characteristics, and finally outputs the output dimension k av Is of the audio-visual characteristics f av Av represents an audio and visual fusion.
The separation network adopts a DPRNN network, and the DPRNN optimizes the RNN network in a deep layer model, so that the DPRNN network can also be efficiently processed when facing long sequences. The DPRNN separation network first separates the incoming audiovisual feature f av Dividing to obtain divided audio-visual fusion blocks, inputting the audio-visual fusion blocks into a BiLSTM network for inter-block processing, and then carrying out superposition reduction on the processed audio-visual fusion blocks to output the prediction mask M of each speaker i I= (1, 2, …, n), n being the number of speakers, prediction mask M i Dimension of (2) and audio feature vector f a The dimensions are the same, and finally the composite mask M i Output f of audio encoder a Multiplied by each other, input to a decoder, and the predicted speaker audio is restored by the decoder.
Since the lip image contains voice information and context information, the present implementation designs the visual encoder to extract visual features as lip visual features of the speaker, the internal structure of which is shown in fig. 3.
In fig. 3, the visual encoder consists of a lip-embedded extractor consisting of 3D convolutional layers and 18 layers res net, and a temporal convolutional block, and the use of CNN can better extract lip features from the input mixed visual information. Meanwhile, in order to avoid the problem of network degradation caused by the increase of network layers, resNet networks are increased. ResNet is composed of 17 convolution layers and 1 full connection layer, the input of network is video frame, and the output is 256-dimensional eigenvector l v ,l v Representing a lip image.
The time convolution block consists of a time convolution, BN, reLU activation function and downsampling, and an input 256-dimensional feature vector l v Inhibiting gradient explosion and gradient disappearance problems through ReLU activation function and BN treatment, and downsampling the feature vector l v The dimension reduction processing is carried out, the convolution kernel size of the time convolution is 3, the channel number is 512, and the stride size is 1. The lip feature vector of the input video image processed by the visual encoder is as follows:
f v =F(Conv1D(v,L v ,S v ))
wherein Conv1D (·) represents the convolution operation, v represents the lip image, L v Representing the convolution kernel size, S v Denote the convolution step size, F (·) denotes the ReLU function.
Because the STFT mode is used for extracting the audio features, the phase information is not considered, and the correlation between the time-frequency domain information and the visual information is small, the audio encoder is designed to extract the audio features from the input mixed voice signal. The audio encoder directly performs audio feature extraction on the mixed voice by adopting one-dimensional convolution, and the structure is shown in fig. 4.
In fig. 4, the audio encoder performs a one-dimensional convolution operation on the input mixed speech, the convolution kernel is 40, the step length is 20, and the mixed speech a n Conversion to k a Represented by dimensionsThis time, expressed using matrix multiplication, is:
W=F(a n U T )
where W represents the convolution calculation result, U represents the encoder base function, and F (·) represents the ReLU function.
After one-dimensional convolution is performed, a rectifying linear unit ReLU function is added, so that a matrix after convolution is ensuredNon-negative. The input mixed speech is processed by an audio encoder to obtain:
f a =F(Conv1D(a n ,L a ,S a ))
wherein Conv1D (·) represents a convolution operation, a n Representing input mixed audio, L a Representing the convolution kernel size, S a Representing the convolution step size.
The decoder uses one-dimensional transpose convolution operation fromThe representation reconstructs the waveform, and is expressed as a matrix multiplication:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the reconstructed speech signal>Representing the decoder basis functions.
S2, performing cross-modal fusion on the audio features and the visual features by adopting a multi-head attention mechanism to obtain audio-visual fusion features:
in order to fully consider the correlation among the modes and realize the joint representation among different modes, the implementation adopts a multi-head attention mechanism on the basis of a cross-mode fusion strategy, and provides a cross-mode fusion module based on the attention mechanism, and the overall structure of the cross-mode fusion module is shown in figure 5.
In FIG. 5, the multi-speaker cross-modality fusion module first characterizes f the lips of different speakers v Splicing, namely downsampling the spliced lip features, and reducing the dimension to obtain visual features f v ' then visual feature f v ' and Audio feature f a And performing cross-mode audio-visual fusion.
The attention mechanism can acquire the local and global relation, and meanwhile, the parameters are few, and the model complexity is low. Therefore, the implementation utilizes the attention mechanism to obtain the part related to the audio frequency characteristic from the visual characteristic in the constructed model, thereby reducing the interference of irrelevant information in the visual characteristic, improving the utilization rate of the visual information, and the expression formula of the attention mechanism is as follows:
wherein Q, K, V respectively represent a query, a key, a value, d k Represents the dimension size of K.
Inspired by the multi-head Attention of a transducer, a Cross-Modal fusion module adopts a Cross-Modal Attention fusion (CMA) strategy, and a leachable parameter lambda is added in the formula, so that the Attention weight can be adaptively adjusted, and the learning parameter lambda can be used as a residual connection I (f) m ) And the convergence speed of the model is accelerated. The mechanism of Attention Cross-Modal fusion (SCMA) is available from the above, and is expressed as:
wherein the visual characteristic f vm Two-dimensional convolution to obtain Q vm And K vm Audio feature f a Two-dimensional convolution to obtain V a D is Q vm ,K vm And V a The output is an audiovisual fusion feature. The specific fusion process is shown in the figure6 (a).
The multi-head Attention is characterized in that a plurality of subspaces are utilized to enable the model to pay Attention to more visual information, so that the fitting performance of the model is further enhanced, the interrelation of different modes is fully utilized, multi-head Attention cross-mode fusion (Multiple Head Cross-Modal Attention, HCMA) is adopted on the basis of SCMA, and the plurality of subspaces are utilized to enable the model to pay Attention to information of different aspects, as shown in fig. 6 (b). The HCMA is to repeat the SCMA process three times, then combine the outputs, and finally output the audio-visual fusion characteristic. HCMA can thus be calculated according to the following formula:
Q vmi =Q vm W i Q ,K vmi =K vm W i K ,V ai =V a W i V ,i=1,2,3
head i =SCMA(Q vmi ,K vmi ,V ai ),i=1,2,3
HCMA(Q vm ,K vm ,V a )=Concact(head 1 ,head 2 ,head 3 )
wherein i represents the number of multi-head attention heads, W i Q 、W i K And W is i V Represent weight training matrix, Q vmi 、K vmi 、V ai Respectively represent Q under different subspaces vm 、K vm 、V a ,head i The fusion result of self-attention is shown.
S3, processing the audio-visual fusion characteristics by using a DPRNN separation network to obtain separation voices of different speakers:
the DPRNN network optimizes RNN in a deep model, can divide longer audio into small blocks in the process of separating audio signals, and iteratively applies intra-block and inter-block operations to enable the long sequences to be processed with high efficiency, so that the implementation is based on the DPRNN network and combines the study results of Wu et al to provide a Conv-AVSS voice separation model. The DPRNN network architecture is shown in fig. 7.
In fig. 7, DPRNN is divided into three stages: segmentation, block processing and overlap-add, input asAudiovisual feature f av First, segment processing is carried out, f av Dividing into overlapped blocks, filling the divided first and last blocks with zero to ensure equal length of each divided block, and connecting the divided blocks to form a 3D tensor f' av
Will 3D tensor f' av Input to stacked DPRNN blocks, each DPRNN block converting an input 3D tensor into another tensor f' having the same shape av . Wherein the DPRNN block comprises two sub-modules, namely local (intra-block) and global (inter-block) operations, respectively, normalizing the intra-block results and inter-block results using linear full concatenation and layer normalization (Layer Normalization, LN), guaranteeing tensor f', and av and f' av In the same dimension;
tensor f av Outputting predicted tensors after passing through dual-path RNN networkFor->An overlap-add operation is performed.After overlap-add, the prediction mask M of each speaker is obtained i I= (1, 2, …, n), n being the number of speakers, prediction mask M i Dimension of (2) and audio feature vector f a The dimensions are the same. Outputting the masking value, multiplying the masking value with the audio frequency characteristic, and obtaining the voice separated by multiple speakers through a decoder.
Experiment:
1. experimental environment
The cross-mode fusion Conv-AVSS network provided by the implementation is realized by using a Pytorch toolkit. Lip data and audio data are processed and training data are preprocessed. Using a weight decay of 10 -2 In the training process, the set batch size is 10, and the total batch size is 500 weeksA period, an initial learning rate (learning rate) of 1×10 is set -4 . If the loss is not reduced for 5 consecutive cycles, the learning rate becomes 1/10 of the original. The experimental equipment adopts a processor Inter (R) Core (TM) i7-9700 CPU@3.00GHz, a memory 32 is installed, an operating system 64-bit Windows10 and a GPU model GEFORCE RTX 2080Ti, and the experiment is operated in a GPU mode.
2. Voice data set
The voice data set adopts a VoxCeleb2 data set, wherein the VoxCeleb2 is an audio-visual data set which is made by collecting YouTube video data by Chimg et al of oxford university, and the data set has 100 ten thousand video clips. These video clips come from 6000 more speakers video worldwide. The VoxCeleb2 dataset contains 140 different ethnicities, different languages, and is relatively balanced in accent, speaker age, and speaker gender. The data set mainly comprises a speech and interview videos, wherein each segment only has an image of one person, the video segment time is different from 4s to 20s, and the videos are subjected to face recognition and face tracking processing, so that the face of a speaker is ensured to be in a picture, and lips are in the middle of the picture.
40000 video clips were downloaded from the VoxCeleb2 dataset. 40000 video clips were first cropped using FFmpeg such that each video clip was 3 seconds in length. 40000 video clips were then randomly split into 4 equal parts, each with 10000 video clips as the data sources for speaker 1, speaker 2, speaker 3, and speaker 4, respectively. And finally numbering the video clips of each speaker.
For the two speaker cases, the speaker 1 and the speaker 2 are correspondingly numbered and mixed to obtain 10000 mixed voices, 9000 video clips are randomly selected as training sets of the models, and the rest 1000 video clips are used as test sets.
Aiming at the situation of three speakers, the speaker 1, the speaker 2 and the speaker 3 are correspondingly numbered and mixed to obtain 10000 mixed voices, 9000 video clips are randomly selected as training sets of models, and the rest 1000 video clips are used as test sets.
For the four-speaker situation, speaker 1, speaker 2, speaker 3 and speaker 4 are mixed with corresponding numbers to obtain 10000 mixed voices, 9000 video clips are randomly selected as training sets of models, and the remaining 1000 are used as test sets.
3. Experimental results
To analyze the performance of the cross-modal fused Conv-AVSS network, the separation of two speakers was taken as an example, and the speech separation effect was evaluated using SDR, PESQ, and STOI, and the results are shown in table 1. In the table, TCN+ "feature stitching" means that the separation network uses TCN, and AV fusion adopts AV base line of feature stitching method; DPRNN+ "feature splicing" means that the separation network uses DPRNN, and the audio-visual fusion adopts the network structure of the feature splicing method; TCN+SCMA means that the separation network uses TCN, and the audio-visual fusion adopts a network structure of self-attention cross-mode fusion; TCN+HCMA means that the separation network uses TCN, and the audio-visual fusion adopts a network structure of multi-head attention cross-mode fusion; DPRNN+SCMA means that the separation network uses DPRNN, and the audio-visual fusion adopts a network structure of self-attention cross-mode fusion; DPRNN+HCMA means that the separated network uses DPRNN, and the audio-visual fusion adopts a network structure of multi-head attention cross-mode fusion.
TABLE 1 ablation experiments of Conv-AVSS model
As shown in table 1, the SDR value of the dprnn+ feature concatenation is 9.53dB, which is improved by 0.38dB compared with the SDR value of the AV baseline without the DPRNN separation network, which indicates that the use of the DPRNN separation network can better perform modeling and effectively improve the audio-visual voice separation performance. The SDR values of DPRNN+SCMA and DPRNN+HCMA are respectively 10.31dB and 11.02dB, and compared with DPRNN+feature splicing, SDR is respectively improved by 0.78dB and 1.49dB, so that modal attention is adopted, and compared with feature splicing, the interrelation among different modalities can be better utilized, and more ideal audio-visual characteristics can be obtained.

Claims (5)

1. An audio-visual cross-modal fusion voice separation method, comprising:
s1, obtaining audio features and lip features of a video by using an audio encoder and a visual encoder;
s2, performing cross-modal fusion on the audio features and the visual features by adopting a multi-head attention mechanism to obtain audio-visual fusion features;
s3, processing the audio-visual fusion characteristics by using a DPRNN separation network to obtain separation voices of different speakers.
2. The audio-visual cross-modal fusion speech separation method according to claim 1, wherein the S1 includes:
s11, the visual encoder consists of a lip embedding extractor and a time convolution block, wherein the lip embedding extractor consists of a residual error network of a 3D convolution layer and an 18 layer, and the visual encoder generates a dimension k through the lip embedding extractor and the time convolution block v Is a lip feature vector f v V represents a lip image;
s12, the audio encoder consists of one-dimensional convolution, and the one-dimensional convolution is used for replacing STFT to generate a dimension k a Is the audio feature vector f of (1) a A denotes input audio.
3. The audio-visual cross-modal fusion speech separation method according to claim 2, wherein in S11: the time convolution block consists of a time convolution, BN, reLU activation function and downsampling, and an input 256-dimensional feature vector l v ,l v Representing lip images, inhibiting gradient explosion and gradient disappearance problems through ReLU activation function and BN processing, and downsampling the feature vector l v Performing dimension reduction processing, wherein the lip feature vector of the input video image processed by the visual encoder is as follows
f v =F(Conv1D(v,L v ,S v ))
Wherein Conv1D (·) represents the convolution operation, v represents the lip image, L v Representing the convolution kernel size, S v Denote the convolution step size, F (·) denotes the ReLU function.
4. Audio-visual cross-modal fusion according to claim 1The voice separation method is characterized in that in the step S2: the multi-speaker cross-modal fusion module based on the multi-head attention mechanism firstly splices visual features of different speakers output by the visual encoder, then carries out cross-modal fusion on the spliced visual features and audio features, and finally outputs the visual features with the dimension k av Is of the audio-visual characteristics f av Av represents an audio and visual fusion.
5. The audio-visual cross-modal fusion speech separation method according to claim 1, wherein in S3: adopts DPRNN as a separation network, and the DPRNN separation network firstly inputs the audio-visual characteristic f av Dividing to obtain divided audio-visual fusion blocks, inputting the audio-visual fusion blocks into a BiLSTM network for inter-block processing, and then carrying out superposition reduction on the processed audio-visual fusion blocks to output the prediction mask M of each speaker i I= (1, 2, …, n), n being the number of speakers, prediction mask M i Dimension of (2) and audio feature vector f a The dimensions are the same, and finally the composite mask M i Output f of audio encoder a Multiplied by each other, input to a decoder, and the predicted speaker audio is restored by the decoder.
CN202310430709.4A 2023-04-20 2023-04-20 Audio-visual cross-mode fusion voice separation method Pending CN116469404A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310430709.4A CN116469404A (en) 2023-04-20 2023-04-20 Audio-visual cross-mode fusion voice separation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310430709.4A CN116469404A (en) 2023-04-20 2023-04-20 Audio-visual cross-mode fusion voice separation method

Publications (1)

Publication Number Publication Date
CN116469404A true CN116469404A (en) 2023-07-21

Family

ID=87183862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310430709.4A Pending CN116469404A (en) 2023-04-20 2023-04-20 Audio-visual cross-mode fusion voice separation method

Country Status (1)

Country Link
CN (1) CN116469404A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117877504A (en) * 2024-03-11 2024-04-12 中国海洋大学 Combined voice enhancement method and model building method thereof
CN117877504B (en) * 2024-03-11 2024-05-24 中国海洋大学 Combined voice enhancement method and model building method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117877504A (en) * 2024-03-11 2024-04-12 中国海洋大学 Combined voice enhancement method and model building method thereof
CN117877504B (en) * 2024-03-11 2024-05-24 中国海洋大学 Combined voice enhancement method and model building method thereof

Similar Documents

Publication Publication Date Title
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN111930992B (en) Neural network training method and device and electronic equipment
Luo et al. Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition.
CN110797002B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
WO2014062521A1 (en) Emotion recognition using auditory attention cues extracted from users voice
Pandey et al. Liptype: A silent speech recognizer augmented with an independent repair model
Li et al. Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation.
Li et al. Deep audio-visual speech separation with attention mechanism
CN112151030A (en) Multi-mode-based complex scene voice recognition method and device
EP3392882A1 (en) Method for processing an input audio signal and corresponding electronic device, non-transitory computer readable program product and computer readable storage medium
Qu et al. Multimodal target speech separation with voice and face references
Xiong et al. Look&listen: Multi-modal correlation learning for active speaker detection and speech enhancement
Yu et al. A two-stage complex network using cycle-consistent generative adversarial networks for speech enhancement
Wang et al. Fastlts: Non-autoregressive end-to-end unconstrained lip-to-speech synthesis
Kadyrov et al. Speaker recognition from spectrogram images
WO2022062800A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
Li et al. VCSE: Time-domain visual-contextual speaker extraction network
Malik et al. A preliminary study on augmenting speech emotion recognition using a diffusion model
CN116417008A (en) Cross-mode audio-video fusion voice separation method
Dumpala et al. A Cycle-GAN approach to model natural perturbations in speech for ASR applications
CN115691539A (en) Two-stage voice separation method and system based on visual guidance
CN116469404A (en) Audio-visual cross-mode fusion voice separation method
CN113327631B (en) Emotion recognition model training method, emotion recognition method and emotion recognition device
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
CN111883105B (en) Training method and system for context information prediction model of video scene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination