CN117854535B - Cross-attention-based audio-visual voice enhancement method and model building method thereof - Google Patents

Cross-attention-based audio-visual voice enhancement method and model building method thereof Download PDF

Info

Publication number
CN117854535B
CN117854535B CN202410263766.2A CN202410263766A CN117854535B CN 117854535 B CN117854535 B CN 117854535B CN 202410263766 A CN202410263766 A CN 202410263766A CN 117854535 B CN117854535 B CN 117854535B
Authority
CN
China
Prior art keywords
audio
module
output
input
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410263766.2A
Other languages
Chinese (zh)
Other versions
CN117854535A (en
Inventor
付民
肖涵予
于靖雯
夏多舜
孙梦楠
郑冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202410263766.2A priority Critical patent/CN117854535B/en
Publication of CN117854535A publication Critical patent/CN117854535A/en
Application granted granted Critical
Publication of CN117854535B publication Critical patent/CN117854535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Analysis (AREA)
  • Stereophonic System (AREA)

Abstract

The invention provides an audiovisual voice enhancement method based on cross attention and a model building method thereof, belonging to the technical field of voice recognition models. Firstly, acquiring the video of a plurality of speakers and the original data of corresponding audios, preprocessing an acquired data set, and acquiring the Mel characteristics of the voice and a facial frame construction data set; constructing an audio preprocessing module based on 1D convolution; constructing facial feature processing based on ResNet-18+cbam and a transducer encoder network; based on the cross attention and the fusion of the audio-visual characteristics by a transducer decoder, a new audio-visual voice enhancement model is built, and the model is used for mixed audio enhancement after the building is completed. Compared with a method using a single audio stream or other fusion audio-visual characteristic fusion methods, the audio-visual voice enhancement model provided by the invention achieves obvious performance improvement.

Description

Cross-attention-based audio-visual voice enhancement method and model building method thereof
Technical Field
The invention belongs to the technical field of speech recognition models, and particularly relates to an audiovisual speech enhancement method based on cross attention and a model building method thereof.
Background
In general, a normal hearing listener can concentrate on a specific acoustic stimulus, targeting a target voice or voice of interest, while filtering out other sounds, a well-known phenomenon known as cocktail party effect, because it is similar to what happens on a cocktail party, thereby raising concerns about the problem of voice enhancement. The purpose of speech enhancement is to eliminate noise components in the signal while preserving a clean speech signal, improving speech quality and intelligibility. With the development of digital signal processing technology, speech enhancement technology has also been greatly developed and improved. The quality and definition of the voice signal are further improved by filtering, enhancing, dereverberation and other treatments of the voice signal through digital technology. The voice enhancement based on the digital signal processing technology can be divided into two major categories, namely a traditional digital voice enhancement method and a voice enhancement method based on a neural network.
Conventional digital speech enhancement methods are usually based on signal processing in the time domain or the frequency domain, and common methods include spectral subtraction, wiener filtering, subspace methods, and the like. It is only applicable to simple noise scenes, but the noise scenes in reality are often complex. In recent years, due to the good generalization performance, features can be automatically learned from a large amount of data, different speech enhancement scenes and tasks can be dealt with, and the application of deep learning in the speech enhancement field is gradually increasing. Numerous models of well behaved speech enhancement are proposed.
However, speech perception is multi-modal in nature, particularly audiovisual, in that in addition to acoustic speech signals reaching the listener's ears, the position and movement of some of the organs of the sound (e.g., tongue, teeth, lips, chin, and facial expressions) that contribute to speech production may also be visible to the recipient. Studies of neuroscience and speech perception have shown that the visual aspect of speech has a potentially strong impact on a person's ability to focus auditory attention on a particular stimulus. Google in 2018 proposes a deep learning-based joint audio-visual voice separation/enhancement model, and compared with a pure audio method, the deep learning-based joint audio-visual voice separation/enhancement model has the advantage that the enhancement performance is remarkably improved. However, the above method is insufficient in the aspect of audio-visual information fusion, and how to effectively combine audio-video characteristics, so that the method still deserves discussion about improving the voice enhancement effect.
Disclosure of Invention
In view of the above problems, a first aspect of the present invention provides a method for building an audio-visual speech enhancement model based on cross attention, including the following steps:
step 1, obtaining the video of a plurality of speakers and the original data of corresponding audios;
Step 2, preprocessing the original data obtained in the step 1; processing the video into a frame-by-frame image respectively, randomly selecting data of a speaker and noise data from the original data, mixing audio in the data according to a certain proportion, performing Mel transformation on the mixed voice to obtain a Mel characteristic diagram of the voice, constructing a data set by combining face frames corresponding to the data of the speaker, and dividing the data set into a training set, a verification set and a test set;
Step 3, constructing an audiovisual voice enhancement model based on cross attention; constructing a visual feature processing module based on Resnet network structures and CBAM attention mechanisms; an audio feature processing module is constructed based on the 1D convolution and Gaussian error linear unit Gelu; obtaining a K matrix and a V matrix of visual features based on a transducer encoder; changing a second self-attention mechanism layer in the original transducer decoder into a cross-attention mechanism layer based on the transducer decoder, taking the audio characteristics as a Q matrix, and taking the output of the decoder as a K matrix and a V matrix; taking a Mel characteristic diagram of the mixed voice and a video facial frame as inputs, outputting a model as a predicted audio Mel characteristic diagram, and finally carrying out inverse Mel spectrum transformation on the Mel characteristic diagram to obtain final predicted audio;
And step 4, training, testing and evaluating the constructed audio-visual voice enhancement model by using the preprocessed data set to obtain a final audio-visual voice enhancement model.
Preferably, the specific process of the pretreatment in the step2 is as follows:
Firstly, each video is cut at 25 frames per second to obtain images arranged according to a time dimension, each image is extracted by using an existing MTCNN face detector based on an OpenCV library to extract a face thumbnail of a target speaker in each image, a Facenet pre-training model is used for extracting the face characteristics of each face thumbnail, and a Facenet pre-training model is used for training a large number of face images to obtain the face images; and then randomly selecting data of a speaker and noise data from the original data, mixing the audio frequencies in the data, performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, and combining facial features corresponding to the data of the speaker to construct a data set.
Preferably, the visual feature processing module consists of a modified Resnet residual network and a convolution block attention module CBAM;
The modified Resnet residual network comprises 1 conv5 convolution layer and 4 conv_res layers; wherein the conv5 layer consists of a convolution kernel with a step size of 5×5 being 1, a batch normalization layer BN and a ReLU activation function, each conv_res layer consists of two identical convolution blocks, each convolution block comprises a convolution kernel with a step size of 1×7 being 1, a BN layer and a ReLU activation function; the input-output formula of the convolution block can be expressed by the following formula:
y = ReLU(x + BN(conv_res (ReLU(BN(conv_res (x))))))
Wherein x represents the input of the convolution block and y represents the output of the convolution block; conv_res is a 1×7 convolution operation; the output of the modified Resnet residual network is used as the input of the CBAM module;
The CBAM module consists of a channel attention module and a space attention module, wherein the CBAM module is positioned behind the modified Resnet residual network and is used for efficiently extracting face key areas with larger audio relevance and ignoring secondary areas outside the face;
the output of the CBAM module is taken as a preliminary visual feature of the network extraction that will be used as a transducer encoder input in the model.
Preferably, the audio feature processing module is composed of 2 1D convolution layers and a gaussian error linear unit Gelu; the output dimension and the input dimension of each 1D convolution layer are the same, and the gaussian error linear unit Gelu has the following formula:
Representative/> Input of activation function,/>Namely/>Activating the output of the function; the audio feature processing module outputs preliminary audio features extracted as a network that will be used as a transducer decoder input in the model.
Preferably, the transducer encoder comprises 6 transducer encoder modules, each transducer encoder module comprising a self-attention mechanism layer and an MLP module;
The input of the transducer encoder is the addition of the output of the visual feature processing module and the sine position code, and the sine position code formula is as follows:
representing the/>, in a position-coding matrix Location, th/>Values of individual dimensions,/>Representing the dimension of the model embedding vector;
In the transducer encoder module, a Q (Query), K (Key), V (Value) matrix in a self-attention mechanism is obtained by performing linear transformation on video features input by an encoder, and the self-attention mechanism input formula is as follows:
wherein, the middle part Column dimension after linear transformation of video features for encoder input,/>The method comprises the steps of inputting video features to an encoder, and performing linear transformation to obtain Q, K and V matrixes;
The MLP module comprises two full-connection layers, a Gaussian error linear unit Gelu and a layer normalization LayerNorm, wherein the output dimension of the full-connection layers is equal to the input dimension of the ransformer encoder module;
The self-attention mechanism layer of the transducer encoder module and the MLP module are connected by a residual structure; the formula is as follows:
Wherein the method comprises the steps of Input for MLP module,/>Output for MLP module,/>For the output of the residual connected MLP module,/>For self-attention mechanism layer input,/>For self-attention mechanism layer output,/>Outputting the self-attention mechanism layer after residual connection;
The 6 transducer encoder modules are connected by residual errors before each module; the formula is as follows:
Wherein, Input for the transducer encoder module,/>Output for the transducer encoder module,/>The transform encoder module outputs after the residual connection.
Preferably, the fransformer decoder comprises 6 fransformer decoder modules, each fransformer decoder module comprises a self-attention mechanism layer, a cross-attention mechanism layer and an MLP module;
The input of the transducer decoder is the addition of the output of the audio feature processing module and the learnable position code;
in the transducer decoder module, a Q (Query), K (Key), V (Value) matrix in a self-attention mechanism is obtained by performing linear transformation on video features input by a decoder, and the self-attention mechanism input formula is as follows:
wherein, the middle part Column dimension after linear transformation of input audio features for decoder,/>A Q, K and V matrix obtained by linear transformation of the input audio features of the decoder; /(I)Representative/>Is a transposed matrix of (a);
In the converter decoder module, a Q (Query) matrix in a cross-attention mechanism is obtained by performing linear transformation, a K (Key) matrix and a V (Value) matrix are obtained by performing linear transformation on the output of a converter encoder, and the cross-attention mechanism input formula is as follows:
Wherein, Representing the output of the cross-attention mechanism layer,/>Column dimension after linear transformation for decoder self-attention mechanism layer output,/>Q matrix obtained by linear transformation of output of self-attention mechanism layer of decoder,/>A K matrix and a V matrix which are obtained by performing linear transformation on the output of the transducer encoder;
The MLP module comprises two fully connected layers, a Gaussian error linear unit Gelu and a layer normalization LayerNorm, wherein the output dimension of the fully connected layers is equal to the input dimension of the converter decoder module;
The self-attention mechanism layer, the cross-attention mechanism layer and the MLP module of the transducer decoder module are all connected by a residual structure; the formula is as follows:
Wherein the method comprises the steps of Input for MLP module in decoder,/>Output for MLP module,/>Outputting the residual error connection MLP module; /(I)For self-attention mechanism layer input in decoder,/>For self-attention mechanism layer output,/>Outputting the self-attention mechanism layer after residual connection; /(I)For cross-attention mechanism layer input in decoder,/>For cross-attention mechanism layer output,/>Outputting a cross attention mechanism layer after residual connection;
The 6 transducer decoder modules are formed, and each module is connected by adopting residual errors; the formula is as follows:
Wherein, Input for a transducer decoder module,/>Output for a transducer decoder module,/>The output of the transform decoder module is used for residual connection.
Preferably, the output of the transducer decoder is processed by an audio processing module to obtain a predicted mel feature map, the predicted mel feature map and the mel feature map of clean audio are trained by using a Mean Square Error (MSE) loss function to obtain a trained mel feature map, and final predicted speech is obtained by using inverse mel feature transformation.
Preferably, in the step 2, the time domain mixed audio is converted into a spectrogram through short-time fourier transform, the audio sampling rate is 16kHz, the audio fragment length is 3s, the stft frame length is 512 sampling points, the frame shift is 160 sampling points, and the hanning window is adopted, and the mel filter bank number is 80.
The second aspect of the present invention also provides a method for enhancing audiovisual speech based on cross-attention, comprising the following steps:
Acquiring a video containing a speaker and corresponding audio;
Processing the acquired video and corresponding audio, and respectively extracting a Mel characteristic diagram and a video face frame of the mixed voice;
Inputting a mel feature map and a video face frame of the mixed voice into a final audio-visual voice enhancement model constructed by the construction method according to the first aspect;
And outputting final predicted audio.
The third aspect of the present invention also provides an audiovisual speech enhancement device, the device comprising at least one processor and at least one memory, the processor and memory being coupled; a computer-implemented program of a final audio-visual speech enhancement model constructed by the construction method according to the first aspect is stored in the memory; the processor, when executing the computer-implemented program stored in the memory, may cause the processor to perform a cross-attention-based audio-visual speech enhancement method.
The fourth aspect of the present invention also provides a computer-readable storage medium storing a computer-executable program for the final audio-visual speech enhancement model constructed by the construction method according to the first aspect, which when executed by a processor, causes the processor to execute an audio-visual speech enhancement method based on cross-attention.
Compared with the prior art, the invention has the following beneficial effects:
The invention provides an audio-visual voice enhancement model based on cross attention, which uses a cross attention mechanism to fuse audio-visual characteristics, uses a transducer encoder and a decoder to build the model, and uses a cross attention mechanism algorithm to enable a network to better utilize the internal connection between visual information and audio information, so that better voice enhancement performance can be realized; aiming at the traditional cascade fusion or addition fusion mode, the two fusion methods are simple and direct and do not need to calculate, but the simple fusion can lose a lot of useful information in a model, so that the effect of separated audio is inaccurate, and the fusion method provided by the invention is obviously superior in effect; compared with the frequency domain feature of the audio signal, the invention carries out Mel feature transformation on the mixed voice signal, and fully utilizes the voice signal information.
Drawings
FIG. 1 is a graph of the true value spectrum of example 1 of the present invention.
Fig. 2 is a graph of a mixed audio spectrum in embodiment 1 of the present invention.
Fig. 3 is a predicted audio spectrum diagram in embodiment 1 of the present invention.
Fig. 4 is a block diagram of an audio-visual speech enhancement model according to the present invention.
Fig. 5 is a block diagram of an MLP module in a transducer encoder module and decoder module according to the present invention.
FIG. 6 is a block diagram of a transducer encoder module according to the present invention.
FIG. 7 is a block diagram of a transducer decoder module according to the present invention.
Fig. 8 is a block diagram of a visual characteristics processing module according to the present invention.
Fig. 9 is a block diagram of an audio feature processing module according to the present invention.
Fig. 10 is a schematic diagram showing the construction of an audio visual speech enhancement apparatus according to embodiment 2 of the present invention.
Detailed Description
Example 1:
the present embodiment further describes the present invention through a specific experimental scenario.
The present embodiment chooses AVspeech and Voxceleb datasets, the AVspeech dataset being a public large-scale audiovisual dataset comprising speech segments without interfering background signals. The segments vary in length from 3 to 10 seconds, and in each segment, the only visible face and the only audible sound in the video belong to one speaker. In general, the dataset contained about 4700 hours of video clips, about 150000 different lecturers, covering a wide variety of people, languages, and facial poses. Voxceleb2 the data set contained 5994 speakers, a total of 1092009 segments in the training set, 118 speakers in the test set, 36237 segments. Each video used for training is 3s.
The data set of the selected noise includes ESC50, MS-SNSD, and VOICe, which cover different noise categories such as natural sound, human non-speech sound, urban noise, and home sound. 500 pieces of noise are selected from the above, randomly fused with 2000 pieces of voice fragments to 200000 pieces of mixed voice, and the mixed voice is also divided into a training set, a test set and a verification set according to the proportion of 8.5:1:0.5. In an actual scene, the video of a plurality of speakers and the original data of corresponding audios can be obtained, the video is processed into a frame-by-frame image respectively, meanwhile, the data of one speaker and the noise data are randomly selected from the original data, the audios are mixed according to a certain proportion, and the mixing proportion is according to scene requirements; the formula for controlling the signal-to-noise ratio is as follows:
(1)
Wherein the method comprises the steps of Is pure voice,/>Is noise,/>To add noise to the voice,/>The value range for controlling the signal-to-noise ratio coefficient is/>,/>The greater the signal-to-noise ratio the lower.
1. Video feature input processing: each video was cropped at 25 frames per second to obtain 3x 25 = 75 images arranged in the time dimension. The existing MTCNN face detector or offline face detector based on the OpenCV library is used for each image to extract the face thumbnail of the target speaker in each picture. And extracting the face characteristics of each face thumbnail by using Facenet pre-training models, wherein Facenet is to map the faces to characteristic vectors of European space through depth separable convolution, and face recognition is performed by judging the distances of the face characteristics of different pictures. Facenet the pre-training model is obtained by training millions of face pictures. A face-embedding vector is extracted for each detected face thumbnail using the lowest layer of the Facenet network that is spatially invariant, and in this embodiment, each face-embedding vector is set to 1792 dimensions. The principle of extracting face features using a pre-training model is that the embedded vector of each face holds the information necessary to identify millions of face pictures, while discarding irrelevant changes between images, such as illumination information, background information, etc. There are related works that suggest that it is feasible to recover facial expressions from these face-embedded vectors. Related work experiments also verify that the use of original image input and face embedded vector input does not improve speech enhancement model performance. The dimension of the face characteristics of each processed speaker is (75, 1,1792, n), and n is the number of speakers. Since the speech enhancement model speaker number is 1, the video feature can be translated (75,1,1792) into that feature that would be input as part of the model video stream.
2. Audio feature input processing: because the speech frequency range that the human ear can distinguish is 0~8000Hz, according to sampling theorem, choose the sampling rate of training pronunciation to be 16kHz. Each initial audio is a one-dimensional time series with dimensions (48000,1). The short-time fourier transform STFT of the 3 second audio is then calculated. The time-frequency domain of the voice is calculated as a complex domain, and the expression of the complex domain is shown in the formula (2):
(2)
In order to obtain Mel characteristics, power spectrums of frames are obtained on the basis of short-time fourier transform (STFT), that is, a modulus operation is performed on a complex domain, and then a Mel filter bank is used for converting between frequency and Mel frequency, and a conversion formula of the Mel frequency and normal frequency is as follows:
(3)
Wherein the method comprises the steps of Is the mel frequency. The principle of the Mel filter bank is based on the auditory system of human ears, and because the human ears only pay attention to components in a specific frequency range and the sensitivity of the human ears to different spectrum signals is different, the Mel filter bank simulates the nonlinear perception of the human ears to the spectrum, and is distributed more densely in a low frequency range and more sparsely in a high frequency range.
For specific parameters, the frame length used by STFT in the experiment is 512 sampling points, the frame shift is 160 sampling points, and a Hanning window is adopted; the mel filter bank number is chosen to be 80. The mel characteristic frequency thus calculated is (298,80). Where 298 is the time dimension size and 80 is the frequency dimension size. The processed audio features will be input as part of the model audio stream. The original truth frequency spectrum diagram is shown in fig. 1, and the preprocessed mixed audio frequency spectrum diagram is shown in fig. 2.
3. Audio-visual speech enhancement model structure: as shown in fig. 4, the model includes a visual feature processing module, an audio feature processing module, a transducer encoder, and a transducer decoder.
The transducer encoder module is shown in fig. 6, the video input features are firstly sent to the video feature processing module in the model, the visual feature processing module is shown in the structural diagram 8, the module is composed of Resnet residual modules and a Convolution Block Attention Module (CBAM), the two modules have good facial feature extraction capability in the fields of image processing and recognition, and the detailed principle of the two modules is not repeated. After processing by the visual feature processing module, the video feature dimension becomes (75, 1,) The video feature dimension (298,1,/>) is obtained by up-sampling and aligning the audio feature in the time dimension) Remodelling it to (298,/>). As can be seen from the related work, the oral area plays the most important role in speech separation or enhanced video features. However, other areas of the eye and cheek, etc., also contribute to this process. Thus, after the incoming video features pass through the network module, the network module may detect most of the lip features and some features of other areas. The processed video features are called/>. Will/>Added to the sinusoidal position code (Sinusoidal Positional Encoding) is denoted/>Into a transducer encoder. The sinusoidal position coding formula is as follows.
(4)
(5)
Wherein,Representing the/>, in a position-coding matrixLocation, th/>Values of the individual dimensions; /(I)Representing the dimension of the model embedding vector; /(I)Representing the/>, in a position-coding matrixValues for the individual dimensions. For video features/>Dimension (298,/>)) For example 298 is the position vector dimension,/>Is of the size/>
Each transducer encoder includes a multi-headed self-attention mechanism and an MLP (linear layer) module. Q (Query), K (Key), V (Value) matrix in self-attention mechanism is characterized by video(298,/>) And (5) performing linear transformation to obtain the product. The output of the self-attention mechanism is:
(6)
Wherein the method comprises the steps of For/>Column dimension after linear transformation was set/>, in experiments. The MLP module includes two fully connected layers, a gaussian error linear unit (Gaussian Error Linear Unit, abbreviated Gelu) and layer normalization, with the specific connection shown in fig. 5. Final output and input video features per transducer encoder module/>Consistent, are (298,/>)). The final network encoder part will be composed by superposition with 6 such encoder modules. The final output of the encoder is denoted/>
The input to the decoder section is the mel-like character of the audio, as shown in fig. 7, which is first fed into the audio feature processing module in the network, which is structured as shown in fig. 9, consisting of two 1D convolutional layers and gaussian error linear units (Gelu). The output dimension is still equivalent to the audio feature input dimension (298,80). Adding it to the learnable position code as input to the transducer decoder module, noted as. Each transducer decoder module includes a self-attention mechanism, a cross-attention mechanism and an MLP module. The self-attention mechanism in the decoder module is the same as the principle of the encoder, and will input audio features/>Self-attention was performed, the output of which is as follows:
(7)
Is marked as Self-attention output/>And encoder output/>Together as an input for Cross-attention (Cross-attention), will/>Performing linear transformation as Q matrix,/>The linear transformation is performed as a K and V matrix. The meaning of cross-attention is understood to be that the audio features of video information are noted, which is similar to the principle of the human brain when dealing with audiovisual features. The output of the cross-attention is as follows:
(8)
Wherein the method comprises the steps of For/>And/>The linearly transformed column dimensions should be theoretically equal, and are set/>, in experiments. The MLP module includes two fully connected layers, a gaussian error linear unit (Gaussian Error Linear Unit, abbreviated Gelu) and layer normalization, which is identical to the MLP module of the encoder. Final output and input audio features/>, of each transducer decoder moduleIn agreement, all were (298,80). The present embodiment will use 6 such decoder modules for superposition to form the final network decoder section. The final output of the decoder will be used as input to the network output module.
Network model output: the output of the decoder is fed into two 1D convolutional layers and Gelu, the feature dimension remains unchanged (298,80). And training the Mel characteristics of the predicted voice by using the Mel characteristics as a training target and performing Mean Square Error (MSE) training on the Mel characteristics of the clean voice. During speech reconstruction, a library feature. Reverse. Mel_to_audio function in Librosa library of Python is selected for speech reconstruction. The parameters of the function are completely consistent with the related parameters when the mel feature is extracted. The predicted audio spectrogram is shown in fig. 3.
4. Model training
Training targets: mel characteristics of clean speech.
Loss function: the experiment uses the mean square error (Mean Squared Error, MSE) as a loss function for model training. The specific definition is as follows:
(9)
Unlike the speech separation experiment, only one person's speech needs to be considered for speech enhancement, so that the loss function does not interfere with each other between different speakers, and only the difference between the prediction mask and the truth mask needs to be considered.
5. Experimental results and evaluation:
Comparison experiments with other relevant speech enhancement models: the embodiment compares the model provided by the invention with several Audio-visual voice enhancement models or pure Audio models, wherein the model comprises an Audio-only CRN, and is a pure Audio voice enhancement model based on the CRN; L2L, a single channel, speaker independent speech enhancement/separation model based on audiovisual neural network; VSE, an audiovisual neural network for visual speech enhancement; AV- (SE) 2, an audiovisual speech enhancement model with a plurality of cross-modal fusion blocks; MHCA-AVCRN, an improved audio-visual speech enhancement model that utilizes multi-headed attention mechanics to learn audio-visual affinities. The specific comparative results of the experiments are shown in table 1.
Table 1 comparative results
The data in table 1 shows that our proposed model has excellent results in terms of performance by comparison with several recently proposed audio-visual speech enhancement methods using deep neural networks above.
Example 2:
As shown in fig. 10, the present application also provides an audiovisual speech enhancement device comprising at least one processor and at least one memory, and further comprising a communication interface and an internal bus; the memory stores computer executing program; a computer-implemented program of the final audio-visual speech enhancement model constructed by the construction method described in embodiment 1 is stored in the memory; the processor, when executing the computer-implemented program stored in the memory, may cause the processor to perform a cross-attention-based audio-visual speech enhancement method. Wherein the internal bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (. XtendedIndustry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus. The memory may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.
The device may be provided as a terminal, server or other form of device.
Fig. 10 is a block diagram of an apparatus shown for illustration. The device may include one or more of the following components: a processing component, a memory, a power component, a multimedia component, an audio component, an input/output (I/O) interface, a sensor component, and a communication component. The processing component generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component may include one or more processors to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component may include one or more modules that facilitate interactions between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.
The memory is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communication component is configured to facilitate communication between the electronic device and other devices in a wired or wireless manner. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
Example 3:
The present invention also provides a computer-readable storage medium storing a computer-executable program for constructing a final audio-visual speech enhancement model constructed by the construction method according to embodiment 1, which when executed by a processor, causes the processor to execute an audio-visual speech enhancement method based on cross-attention.
In particular, a system, apparatus or device provided with a readable storage medium on which a software program code implementing the functions of any of the above embodiments is stored and whose computer or processor is caused to read and execute instructions stored in the readable storage medium may be provided. In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
The storage medium may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks (e.g., CD-ROM, CD-R, CD-RW, DVD-20 ROM, DVD-RAM, DVD-RW), magnetic tape, and the like. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
It should be understood that a storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the storage medium may reside as discrete components in a terminal or server.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.

Claims (6)

1. The method for building the audio-visual voice enhancement model based on the cross attention is characterized by comprising the following steps of:
step 1, obtaining the video of a plurality of speakers and the original data of corresponding audios;
Step 2, preprocessing the original data obtained in the step 1; processing the video into a frame-by-frame image respectively, randomly selecting data of a speaker and noise data from the original data, mixing audio in the data according to a certain proportion, performing Mel transformation on the mixed voice to obtain a Mel characteristic diagram of the voice, constructing a data set by combining face frames corresponding to the data of the speaker, and dividing the data set into a training set, a verification set and a test set;
Step 3, constructing an audiovisual voice enhancement model based on cross attention; constructing a visual feature processing module based on Resnet network structures and CBAM attention mechanisms; an audio feature processing module is constructed based on the 1D convolution and Gaussian error linear unit Gelu; obtaining a K matrix and a V matrix of visual features based on a transducer encoder; changing a second self-attention mechanism layer in the original transducer decoder into a cross-attention mechanism layer based on the transducer decoder, taking the audio characteristics as a Q matrix, and taking the output of the decoder as a K matrix and a V matrix; taking a Mel characteristic diagram of the mixed voice and a video facial frame as inputs, outputting a model as a predicted audio Mel characteristic diagram, and finally carrying out inverse Mel spectrum transformation on the Mel characteristic diagram to obtain final predicted audio;
the visual characteristic processing module consists of a modified Resnet residual network and a convolution block attention module CBAM;
The modified Resnet residual network comprises 1 conv5 convolution layer and 4 conv_res layers; wherein the conv5 layer consists of a convolution kernel with a step size of 5×5 being 1, a batch normalization layer BN and a ReLU activation function, each conv_res layer consists of two identical convolution blocks, each convolution block comprises a convolution kernel with a step size of 1×7 being 1, a BN layer and a ReLU activation function; the input-output formula of the convolution block can be expressed by the following formula:
y = ReLU(x + BN(conv_res (ReLU(BN(conv_res (x))))))
Wherein x represents the input of the convolution block and y represents the output of the convolution block; conv_res is a 1×7 convolution operation; the output of the modified Resnet residual network is used as the input of the CBAM module;
The CBAM module consists of a channel attention module and a space attention module, wherein the CBAM module is positioned behind the modified Resnet residual network and is used for efficiently extracting face key areas with larger audio relevance and ignoring secondary areas outside the face;
The output of the CBAM module is taken as a preliminary visual feature of the network extraction that will be used as a transducer encoder input in the model;
The audio feature processing module consists of 2 1D convolution layers and a Gaussian error linear unit Gelu; the output dimension and the input dimension of each 1D convolution layer are the same, and the gaussian error linear unit Gelu has the following formula:
Representative/> Input of activation function,/>Namely/>Activating the output of the function; the audio feature processing module outputs preliminary audio features extracted as a network that will be used as a transducer decoder input in the model;
The transducer encoder comprises 6 transducer encoder modules, wherein each transducer encoder module comprises a self-attention mechanism layer and an MLP module;
The input of the transducer encoder is the addition of the output of the visual feature processing module and the sine position code, and the sine position code formula is as follows:
representing the/>, in a position-coding matrix Location, th/>Values of individual dimensions,/>Representing the dimension of the model embedding vector;
in the transducer encoder module, the Q, K and V matrixes in the self-attention mechanism are obtained by linear transformation of the input video characteristics of the encoder, and the self-attention mechanism input formula is as follows:
wherein, the middle part Column dimension after linear transformation of video features for encoder input,/>The method comprises the steps of inputting video features to an encoder, and performing linear transformation to obtain Q, K and V matrixes;
The MLP module comprises two full-connection layers, a Gaussian error linear unit Gelu and a layer normalization LayerNorm, wherein the output dimension of the full-connection layers is equal to the input dimension of the ransformer encoder module;
The self-attention mechanism layer of the transducer encoder module and the MLP module are connected by a residual structure; the formula is as follows:
Wherein the method comprises the steps of Input for MLP module,/>Output for MLP module,/>For the output of the residual connected MLP module,/>For self-attention mechanism layer input,/>For self-attention mechanism layer output,/>Outputting the self-attention mechanism layer after residual connection;
The 6 transducer encoder modules are connected by residual errors before each module; the formula is as follows:
Wherein, Input for the transducer encoder module,/>Output for the transducer encoder module,/>The output of the transducer encoder module is used for carrying out residual error connection;
the converter decoder comprises 6 converter decoder modules, wherein each converter decoder module comprises a self-attention mechanism layer, a cross-attention mechanism layer and an MLP module;
The input of the transducer decoder is the addition of the output of the audio feature processing module and the learnable position code;
In the converter decoder module, the Q, K and V matrixes in the self-attention mechanism are obtained by linear transformation of the video features input by the decoder, and the self-attention mechanism input formula is as follows:
wherein, the middle part Column dimension after linear transformation of input audio features for decoder,/>A Q, K and V matrix obtained by linear transformation of the input audio features of the decoder; /(I)Representative/>Is a transposed matrix of (a);
in the converter decoder module, a Q matrix in a cross attention mechanism is obtained by linear transformation, a K matrix and a V matrix are obtained by linear transformation of the output of a converter encoder, and the cross attention mechanism input formula is as follows:
Wherein, Representing the output of the cross-attention mechanism layer,/>Column dimension after linear transformation for decoder self-attention mechanism layer output,/>Q matrix obtained by linear transformation of output of self-attention mechanism layer of decoder,/>A K matrix and a V matrix which are obtained by performing linear transformation on the output of the transducer encoder;
The MLP module comprises two fully connected layers, a Gaussian error linear unit Gelu and a layer normalization LayerNorm, wherein the output dimension of the fully connected layers is equal to the input dimension of the converter decoder module;
The self-attention mechanism layer, the cross-attention mechanism layer and the MLP module of the transducer decoder module are all connected by a residual structure; the formula is as follows:
Wherein the method comprises the steps of Input for MLP module in decoder,/>Output for MLP module,/>Outputting the residual error connection MLP module; /(I)For self-attention mechanism layer input in decoder,/>For self-attention mechanism layer output,/>Outputting the self-attention mechanism layer after residual connection; /(I)For cross-attention mechanism layer input in decoder,/>For cross-attention mechanism layer output,/>Outputting a cross attention mechanism layer after residual connection;
The 6 transducer decoder modules are formed, and each module is connected by adopting residual errors; the formula is as follows:
Wherein, Input for a transducer decoder module,/>Output for a transducer decoder module,/>The output of the converter decoder module is used for carrying out residual error connection;
And step 4, training, testing and evaluating the constructed audio-visual voice enhancement model by using the preprocessed data set to obtain a final audio-visual voice enhancement model.
2. The method for building the cross-attention-based audio-visual speech enhancement model according to claim 1, wherein the specific process of preprocessing in the step2 is as follows:
Firstly, each video is cut at 25 frames per second to obtain images arranged according to a time dimension, each image is extracted by using an existing MTCNN face detector based on an OpenCV library to extract a face thumbnail of a target speaker in each image, a Facenet pre-training model is used for extracting the face characteristics of each face thumbnail, and a Facenet pre-training model is used for training a large number of face images to obtain the face images; and then randomly selecting data of a speaker and noise data from the original data, mixing the audio frequencies in the data, performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, and combining facial features corresponding to the data of the speaker to construct a data set.
3. The method for building the cross-attention-based audio-visual speech enhancement model according to claim 1, wherein: and the output of the transducer decoder is processed by an audio processing module to obtain a predicted Mel characteristic diagram, the predicted Mel characteristic diagram and the Mel characteristic diagram of clean audio are trained by using a mean square error MSE loss function to obtain a trained Mel characteristic diagram, and then the final predicted voice is obtained by using inverse Mel characteristic transformation.
4. The method for building the cross-attention-based audio-visual speech enhancement model according to claim 1, wherein: in the step 2, the time domain mixed audio is converted into a spectrogram through short-time Fourier transform, the audio sampling rate is 16kHz, the audio fragment length is 3s, the STFT frame length is 512 sampling points, the frame shift is 160 sampling points, a Hanning window is adopted, and the number of the Mel filter bank is 80.
5. A method of cross-attention based audio-visual speech enhancement comprising the steps of:
Acquiring a video containing a speaker and corresponding audio;
Processing the acquired video and corresponding audio, and respectively extracting a Mel characteristic diagram and a video face frame of the mixed voice;
Inputting a mel feature map and a video face frame of the mixed voice into a final audio-visual voice enhancement model constructed by the construction method according to any one of claims 1 to 4;
And outputting final predicted audio.
6. An audiovisual speech enhancement device, characterized by: the apparatus includes at least one processor and at least one memory, the processor and the memory coupled; a computer-implemented program of a final audio-visual speech enhancement model constructed by the construction method according to any one of claims 1 to 4 is stored in the memory; the processor, when executing the computer-implemented program stored in the memory, may cause the processor to perform a cross-attention-based audio-visual speech enhancement method.
CN202410263766.2A 2024-03-08 2024-03-08 Cross-attention-based audio-visual voice enhancement method and model building method thereof Active CN117854535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410263766.2A CN117854535B (en) 2024-03-08 2024-03-08 Cross-attention-based audio-visual voice enhancement method and model building method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410263766.2A CN117854535B (en) 2024-03-08 2024-03-08 Cross-attention-based audio-visual voice enhancement method and model building method thereof

Publications (2)

Publication Number Publication Date
CN117854535A CN117854535A (en) 2024-04-09
CN117854535B true CN117854535B (en) 2024-05-07

Family

ID=90548451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410263766.2A Active CN117854535B (en) 2024-03-08 2024-03-08 Cross-attention-based audio-visual voice enhancement method and model building method thereof

Country Status (1)

Country Link
CN (1) CN117854535B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114639387A (en) * 2022-03-07 2022-06-17 哈尔滨理工大学 Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram
CN116013343A (en) * 2022-12-16 2023-04-25 思必驰科技股份有限公司 Speech enhancement method, electronic device and storage medium
CN116129931A (en) * 2023-04-14 2023-05-16 中国海洋大学 Audio-visual combined voice separation model building method and voice separation method
WO2023197749A1 (en) * 2022-04-15 2023-10-19 腾讯科技(深圳)有限公司 Background music insertion time point determining method and apparatus, device, and storage medium
CN117275452A (en) * 2023-05-30 2023-12-22 杭州电子科技大学 Speech synthesis system based on face grid
WO2024032159A1 (en) * 2022-08-12 2024-02-15 之江实验室 Speaking object detection in multi-human-machine interaction scenario

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11244696B2 (en) * 2019-11-06 2022-02-08 Microsoft Technology Licensing, Llc Audio-visual speech enhancement

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114639387A (en) * 2022-03-07 2022-06-17 哈尔滨理工大学 Voiceprint fraud detection method based on reconstructed group delay-constant Q transform spectrogram
WO2023197749A1 (en) * 2022-04-15 2023-10-19 腾讯科技(深圳)有限公司 Background music insertion time point determining method and apparatus, device, and storage medium
WO2024032159A1 (en) * 2022-08-12 2024-02-15 之江实验室 Speaking object detection in multi-human-machine interaction scenario
CN116013343A (en) * 2022-12-16 2023-04-25 思必驰科技股份有限公司 Speech enhancement method, electronic device and storage medium
CN116129931A (en) * 2023-04-14 2023-05-16 中国海洋大学 Audio-visual combined voice separation model building method and voice separation method
CN117275452A (en) * 2023-05-30 2023-12-22 杭州电子科技大学 Speech synthesis system based on face grid

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Enhancing Synthesized Speech Detection with Dual Attention Using Features Fusion;Wang, Bo, et al.;《CCAT》;20240205;全文 *
基于噪声分类与补偿的车载语音识别;项秉伟;景新幸;杨海燕;;计算机工程;20170315(第03期);全文 *
基于并行多注意力的语音增强算法;张池 等;《计算机工程》;20231123;全文 *

Also Published As

Publication number Publication date
CN117854535A (en) 2024-04-09

Similar Documents

Publication Publication Date Title
US10777215B2 (en) Method and system for enhancing a speech signal of a human speaker in a video using visual information
Gabbay et al. Visual speech enhancement
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
Hou et al. Audio-visual speech enhancement using multimodal deep convolutional neural networks
WO2020007185A1 (en) Image processing method and apparatus, storage medium and computer device
Adeel et al. Lip-reading driven deep learning approach for speech enhancement
Tan et al. Audio-visual speech separation and dereverberation with a two-stage multimodal network
Chuang et al. Improved lite audio-visual speech enhancement
Gabbay et al. Seeing through noise: Speaker separation and enhancement using visually-derived speech
CN116129931B (en) Audio-visual combined voice separation model building method and voice separation method
CN112466306B (en) Conference summary generation method, device, computer equipment and storage medium
CN117854535B (en) Cross-attention-based audio-visual voice enhancement method and model building method thereof
WO2023020500A1 (en) Speech separation method and apparatus, and storage medium
Luo et al. Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments.
CN115691539A (en) Two-stage voice separation method and system based on visual guidance
CN117877504B (en) Combined voice enhancement method and model building method thereof
CN115472174A (en) Sound noise reduction method and device, electronic equipment and storage medium
Zheng et al. Incorporating ultrasound tongue images for audio-visual speech enhancement through knowledge distillation
Lee et al. Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model
CN117877504A (en) Combined voice enhancement method and model building method thereof
CN116403599B (en) Efficient voice separation method and model building method thereof
Xiang et al. A two-stage deep representation learning-based speech enhancement method using variational autoencoder and adversarial training
Smietanka et al. Augmented Transformer for Speech Detection in Adverse Acoustical Conditions
Morrone Metodologie di Apprendimento Profondo per l'Elaborazione Audio-Video del Parlato in Ambienti Rumorosi
CN112786052B (en) Speech recognition method, electronic equipment and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant