CN116129931A

CN116129931A - Audio-visual combined voice separation model building method and voice separation method

Info

Publication number: CN116129931A
Application number: CN202310394927.7A
Authority: CN
Inventors: 付民; 李贵竹; 刘雪峰; 孙梦楠; 闵健; 董亮; 刘英哲; 闫劢; 郑冰
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-05-16
Anticipated expiration: 2043-04-14
Also published as: CN116129931B

Abstract

The invention provides a method for constructing a voice separation model by audio-visual combination and a voice separation method, which belong to the technical field of voice separation, wherein the method for constructing the model comprises the following steps: acquiring the video of a plurality of speakers and the original data of corresponding audios, preprocessing the acquired original data, and acquiring a spectrogram of voice, a face frame and a mouth action frame to construct a data set; constructing an audio separation module based on a U-Net network, constructing a face module based on a ResNet-18 network, constructing a mouth action module based on a SheffeNet-V2 network and a TCN network, combining the three to form a new network model, and training the model to select a model with highest accuracy; and after the model is built, the model is used for mixed audio separation. Compared with a method using a single video stream, the audio-visual combined voice separation model provided by the invention achieves obvious performance improvement. A comparison experiment is carried out on the public data set, and the effectiveness of the method is verified.

Description

Audio-visual combined voice separation model building method and voice separation method

Technical Field

The invention belongs to the technical field of voice separation, and particularly relates to a method for constructing a voice separation model by audio-visual combination and a voice separation method.

Background

In an environment where multiple sound sources are sounding simultaneously, a human being can process the received sound signals by means of his own sensitive auditory system, focusing on the target sound, while ignoring other sounds of no interest, a phenomenon defined by Cherry in his work as "cocktail party effect". A great deal of attention has been paid to the problem of speech separation. The problem of speech separation is one of the key tasks to solve the cocktail party effect, which refers to extracting a single person's sound signal from a plurality of speech signals that are present in an overlapping manner. Along with the development of intellectualization, the voice separation technology plays a role in numerous voice interaction devices, such as being used as a hearing aid to help hearing impaired people hear external sounds, facilitating voice control in intelligent home, assisting operation of a mobile phone voice assistant, assisting in analyzing case information voice clues, improving conversation efficiency and quality in online conferences, and the like. However, the performance of the current voice separation technology is far behind that of the human auditory system, and how to efficiently realize the voice separation effect close to that of human is still a technical problem.

The early-stage speech separation methods widely used include spectral subtraction, computer auditory scene analysis, hidden Markov models and the like, which are shallow models and cannot fully extract signal characteristics, and the effect of the method is often based on prior knowledge or specific microphone configuration and lacks the ability to learn from a large amount of data. In recent years, with the development of deep learning techniques, numerous models that perform well in speech separation have been proposed.

In fact, the ability of a human being to concentrate on a particular sound depends not only on the sound, but also on visual information such as the sex, age, opening and closing of lips, etc. of the speaker. Such non-speech information has been demonstrated in psychological studies to enhance the ability of humans to focus on target speech in complex environments. More and more models have been proposed in recent years that incorporate visual information to assist in speech separation. Google in 2018 proposes a joint audio-visual voice separation model based on deep learning, and compared with a pure audio method, the separation performance of google is obviously improved. A time-domain audio-visual speech separation architecture is proposed by the skilled person, visual information is pre-trained by a lip-embedding extractor, word-level and phoneme lip-embedding auxiliary separation is used, and a network directly predicts a target speech waveform. However, the method only uses single visual information, how to effectively extract and use audio and video features, so that the method is more robust when facing more complex scenes, and still deserves discussion.

Disclosure of Invention

In view of the above problems, a first aspect of the present invention provides a method for constructing an audio-visual combined speech separation model, including the steps of:

step 1, acquiring original data of videos and corresponding audios of a plurality of speakers, wherein the original data are shot or downloaded in different scenes;

Step 2, preprocessing the original data obtained in the step 1; processing videos into images of one frame and one frame respectively, randomly selecting data of two speakers from original data, mixing audio in the data, performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, combining face frames and mouth action frames corresponding to the data of the two speakers to construct a data set, and dividing the data set into a training set, a verification set and a test set;

step 3, based on a U-Net network structure, establishing residual connection in a partial downsampling convolution block of the traditional U-Net to obtain a residual convolution block, and then adding a BN layer between convolution of a compression path and an expansion path and a ReLU activation function to construct an audio separation module; based on the ResNet-18 network structure, adding a CBAM attention mechanism before and after a basic convolution block of ResNet-18 respectively to construct a face module; based on the SheffleNet-V2 and TCN network structure, the method combines a 3D convolution layer to construct a mouth action module; combining the three network modules to construct an AV-Resunate network model; wherein a spectrogram of the mixed speech is input into the audio separation module, a face frame is input into the face module, and a mouth action frame is input into the mouth action module;

Step 4, training and verifying the AV-Resunate network model built in the step 3 by using the training set and the verification set in the step 2; selecting a model with the best verification effect in the training process as a final test model;

and 5, testing the finally selected AV-Resunate network model by using the data in the test set.

Preferably, the specific process of the pretreatment in the step 2 is as follows: firstly, processing a video into an image frame by frame, and selecting a frame as a face frame; each frame of image is used for acquiring facial key points by using an SFD (small form-factor detector), removing the difference related to the position, positioning the position of the lips, then cutting the lips into a fixed size, and taking the lips as a mouth action frame after gray-scale treatment, wherein the frame number is 64; and then randomly selecting the data of two speakers from the original data, mixing the audio frequencies in the data, and then performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, and combining face frames and mouth action frames corresponding to the data of the two speakers to construct a data set.

Preferably, the audio separation module is improved based on a U-Net network and comprises a conv layer, a res_conv layer, audio-visual feature fusion, an up_conv layer and a Tanh function;

The conv layer consists of a convolution kernel with the size of 4 multiplied by 4 and the step length of 2, a BN layer and a ReLU activation function, wherein two conv layers are respectively in a compression path and an expansion path of the network, namely unet_conv and unet_conv;

the res_conv layers comprise 6 layers, namely res_conv1, res_conv2, res_conv3, res_conv4, res_conv5 and res_conv6, and each layer consists of two convolution kernels with the 3×3 step length of 1, two BN layers, two ReLU activation functions, one Maxpool layer and one residual error connection;

the method comprises the steps of dividing the input data into two types according to the difference of the channel numbers of the input data and the output data, directly adding the input data and the convolved output to be pooled when the channel numbers are the same, and adding the input data after one convolution kernel processing when the channel numbers are different;

the audio-visual feature fusion is a process of fusing audio features obtained after the compression path processing with visual features extracted by a visual network in a time dimension to obtain audio-visual fusion features;

the up_conv layer consists of an Upsample layer, a convolution kernel with the size of 3 multiplied by 3 and the step length of 1, a BN layer and a ReLU activation function, wherein Upsample replaces Maxpool in a compression path;

And the Tanh function compresses data to a section from-1 to 1, outputs a separated mask, multiplies the separated mask by a spectrogram of the mixed voice to obtain an independent speaker voice spectrogram, and then recovers the clean voice of the speaker through inverse short time Fourier transform.

Preferably, the face module is modified based on a ResNet-18 network, and comprises a conv7 layer, a CBAM layer, a res layer, a pooling layer and a linear layer;

the conv7 layer consists of a convolution kernel with the size of 7 multiplied by 7 and the step length of 2, a BN layer and a ReLU activation function, and the output of the conv7 layer is used as the input of the CBAM layer;

the CBAM layer consists of Channel Attention and Spatial Attention, is respectively positioned before the first res layer and after the last res layer and is used for efficiently extracting a face key region with larger audio relevance and ignoring secondary regions outside the face;

the res layer comprises four layers res1, res2, res3 and res4, and each of the four layers comprises 2 convolution blocks, wherein each convolution block in res1 consists of a 3×3 convolution kernel, a BN layer and a ReLU activation function, and the convolution blocks can be represented by the following formula:

y = ReLU(x + BN(conv3(ReLU(BN(conv3(x))))))

wherein x represents the input of the convolution block and y represents the output of the convolution block; conv3 is a 3×3 convolution operation, BN batch normalization layer; reLU refers to a ReLU activation function;

The first convolution block in res2, res3 and res4 is the same as res1, the second convolution block is composed of a 3×3 convolution kernel, BN layer, downsampling layer and ReLU activation function, and the second convolution block can be expressed by the following formula:

y = ReLU ( Downsample(x) + BN(conv3(ReLU (BN(conv3(x))))))

wherein downsampled refers to a downsampling layer;

the pooling layer comprises maximum pooling and average pooling, wherein the maximum pooling is positioned after the first CBAM layer and is used for reducing the parameter number and simplifying the complexity of the network; the average pooling is located after the second CBAM, the average pooled output being the input to the final linear layer;

the output of the linear layer is taken as the final facial feature extracted by the network, and the final facial feature is combined with the lip feature after being copied in the time dimension to form the visual feature required by the model.

Preferably, the mouth action module is constructed based on a ShuffleNet-V2 and TCN network structure and combined with a 3D convolution layer, wherein the 3D convolution layer is composed of a convolution kernel with a size of 5×7×7 step size of 1×2×2, a BN layer, a ReLU activation function, and a 3D max-pooling layer with a size of 1×3×3 step size of 1×2×2;

the ShuffleNet-V2 network includes a convolutional layer, a pooling layer, a fully-connected layer, a packet convolution, and a depth separable convolution; the TCN network is composed of a plurality of residual blocks, and maps the time-indexed sequence of the feature vector extracted by the ShuffleNet-V2 network into a new sequence by using 1D time convolution, so as to finally obtain the lip motion feature with dimension of 512×64.

Preferably, in the training process, the model constructed in the step 4 uses complex domain ideal ratio to mask the cIRM as a training target of the audio, and uses a triplet loss to calculate the similarity between the audio and the facial image, wherein the calculation formula of the cIRM is as follows:

wherein X is _r And X _i Representing the real and imaginary parts of the mixed speech signal, S _r And S is _i Representing the real and imaginary parts of clean speech.

Preferably, in the step 2, the time-domain mixed audio is converted into a spectrogram through short-time fourier transform, the audio is subjected to a 16kHz sampling rate, the audio fragment length is 2.55s, and the stft has a window length of 400, a skip size of 160 and an FFT size of 512.

Preferably, the input of the audio separation module is specifically: regarding visual input, the facial image frame size is 224×224, extracted via a network into facial features of dimension 128; the mouth action frame is input into 88 x 88, extracted into 512 x 64 mouth features through a network, and combined with facial features to finally obtain 640 x 64 visual features, wherein the visual features are used as visual input of an audio separation module; regarding audio input, the signal spectrogram of the mixed audio is taken as the visual input of the audio separation module, the dimension is 2×257×256, and a prediction mask consistent with the dimension of the input spectrogram is obtained after the network.

In a second aspect, the invention provides an audio-visual combined speech separation method, comprising the following steps:

acquiring video and corresponding audio containing two speakers;

processing the acquired video and corresponding audio, respectively extracting face frames and mouth action frames of a speaker in the video,

inputting a face frame, a mouth action frame and corresponding audio into a voice separation model constructed by the construction method according to the first aspect;

outputting each separated speaker and the corresponding clean voice.

The third aspect of the present invention also provides an audiovisual combined speech separation device, the device comprising at least one processor and at least one memory, the processor and memory being coupled; a computer-implemented program of a speech separation model constructed by the construction method according to the first aspect is stored in the memory; the processor may be caused to implement a voice separation method when executing a computer-implemented program stored in the memory.

Compared with the prior art, the invention has the following beneficial effects:

compared with a cross-mode voice separation model and a pure audio voice separation model which simply use one piece of visual information, the auxiliary of the double visual information can enable a network to better utilize the internal connection between the visual information and the audio information, and can realize better separation performance; in order to further improve the extraction of the visual features and consider that the video also comprises secondary information except the human face, the network is assisted to acquire the most critical human face area through a two-layer attention mechanism in the process of extracting the facial features, so that the visual information is utilized more efficiently; aiming at the defect that the traditional U-Net network model is easy to ignore data details, a residual connection mechanism is introduced into the U-Net network, residual connection is added in a convolution block of a compression path to assist the network in better extracting detail characteristics, and a BN layer is added after the convolution layer to accelerate the training and convergence speed of the network; compared with the frequency domain feature of the audio signal, the invention carries out STFT conversion on the mixed voice signal, and fully utilizes the amplitude information and the phase information of the voice signal.

Drawings

Fig. 1 is a block diagram of an audio-visual combined speech separation model according to the present invention.

Fig. 2 is a network configuration diagram of an audio separation module.

FIG. 3 is a block diagram of a convolutional layer in a modified U-Nnet network.

Fig. 4 is a diagram of a residual convolution block structure in a modified U-Net network.

Fig. 5 is a network configuration diagram of the face module.

FIG. 6 is a schematic diagram of a CBAM layer attention mechanism module.

Fig. 7 is a network configuration diagram of the mouth motion module.

Fig. 8 is a block diagram showing the structure of the speech separation apparatus in embodiment 2.

Detailed Description

The invention will be further described with reference to specific examples.

Example 1:

the invention provides an audio-visual combined voice separation method based on double visual clues, which mainly comprises the following steps as shown in figure 1:

The present embodiment performs experiments on a VoxCeleb2 dataset that contains more than 1000000 speech segments and their corresponding video segments downloaded from YouTube, with a balanced distribution of male and female proportions, and speakers from multiple countries.

1. Acquiring raw data

Since the video contained in the VoxCeleb2 dataset is captured in a large number of challenging visual and auditory environments. Including on red carpets, outdoor stadiums, and studio interviews, etc., which result in uneven quality of video captured. Some of the speakers in the dataset have very blurred video segments, and the network has great challenges in extracting useful information from them. Therefore, in the embodiment, the partially blurred video in the data set is deleted, and better video and audio quality is ensured.

2. Data preprocessing

Preprocessing the obtained original data; firstly, processing a video into an image frame by frame, selecting one frame as a face frame, wherein the resolution is 224 multiplied by 224; each frame of image is used for acquiring a facial key point by using an SFD (small form factor) face detector, a face in a video is aligned with a reference plane, the similarity transformation is used for removing the difference related to the position and positioning the position of a lip, then the image is cut into a size of 96 multiplied by 96, and the image is used as a mouth action frame after being subjected to gray-scale treatment, so that the image is stored as a h5 file which is convenient for a subsequent model to read data; and then randomly selecting the data of two speakers from the original data, mixing the audio frequencies in the data, and then performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, and combining face frames and mouth action frames corresponding to the data of the two speakers to construct a data set.

3. Model construction

In this embodiment, the audio separation module is improved based on the U-Net network, and includes a conv layer, a res_conv layer, an audio-visual feature fusion layer, an up_conv layer, and a Tanh function; the specific structure is shown in figure 2.

The conv layer consists of a convolution kernel with the size of 4 multiplied by 4 and the step length of 2, a BN (Batch Normalization) layer and a ReLU activation function, wherein two conv layers are respectively in a compression path and an expansion path of the network, namely unet_conv and unet up_conv; the detailed structure is shown in figure 3; (a) Representing the conv layer in the U-Net network and (b) representing the up _ conv layer in the U-Net network.

The res_conv layers are 6 layers, namely res_conv1, res_conv2, res_conv3, res_conv4, res_conv5 and res_conv6 respectively, and each layer consists of two convolution kernels with the 3×3 step size of 1, two BN layers, two ReLU activation functions, one Maxpool layer and one residual connection;

the number of data input and output channels of res_conv1 and res_conv2 is different, the number of channels of input and output of the other four layers is the same, residual connection is divided into two types according to the difference of the number of channels of input data and output data, when the number of channels is the same, input and convolved output are directly added and then pooled, when the number of channels is different, input is subjected to convolution kernel processing and then added, residual connection can effectively avoid the problem of gradient disappearance of a network in the training process and can help the network to better extract indistinguishable small details, the specific structure is shown in figure 4, (a) represents the 1 st to 2 th residual convolution blocks, and (b) represents the 4 th to 6 th residual convolution blocks. The number of input/output channels of each layer of the improved U-Net network is shown in Table 1.

Table 1 improvement of channel numbers for layers of U-Net networks

The audio-visual feature fusion is a process of fusing the audio features processed by the compression path with the visual features extracted by the visual network in the time dimension to obtain audio-visual fusion features;

the up_conv layer consists of an Upsample layer, a convolution kernel with the size of 3x3 and the step length of 1, a BN layer and a ReLU activation function, wherein the Upsample replaces Maxpool in a compression path, and the specific structure is shown in figure 3;

and compressing the data to the interval between-1 and 1 by using the tanh function, outputting a separated mask, multiplying the separated mask by the spectrogram of the mixed voice to obtain an independent speaker voice spectrogram, and recovering the clean voice of the speaker through inverse short-time Fourier transform.

A face module that improves based on a ResNet-18 network, including a conv7 layer, a CBAM layer, a res layer, a pooling layer, and a linear layer;

the conv7 layer consists of a convolution kernel with the size of 7×7 step length of 2, a BN layer and a ReLU activation function, the output of the conv7 layer is used as the input of the CBAM layer, and the specific structure of the face module is shown in fig. 5.

The CBAM (Convolutional Block Attention Module) layer consists of Channel Attention and Spatial Attention, the CBAM layer is respectively positioned before the first res layer and after the last res layer and is used for efficiently extracting a face key region with larger audio relevance and ignoring a secondary region outside the face; the CBAM layer structure is shown in fig. 6.

The res layer comprises four layers res1, res2, res3 and res4, and each comprises 2 convolution blocks, wherein each convolution block in res1 consists of a 3×3 convolution kernel, a BN layer and a ReLU activation function, and the convolution blocks can be represented by the following formula:

y = ReLU(x + BN(conv3(ReLU(BN(conv3(x))))))

y = ReLU ( Downsample(x) + BN(conv3(ReLU (BN(conv3(x))))))

wherein downsampled refers to a downsampling layer; the remainder is identical to the first convolution block.

The pooling layer comprises a maximum pooling layer and an average pooling layer, wherein the maximum pooling layer is positioned behind the first CBAM layer and is used for reducing the parameter number and simplifying the complexity of the network; the average pooling is located after the second CBAM, the average pooled output being the input to the final linear layer; the output of the linear layer is taken as facial features extracted by a network, and the facial features are combined with lip features after being copied in a time dimension to form visual features required by a model;

the mouth action module is constructed based on a SheffleNet-V2 and TCN network structure and combined with a 3D convolution layer, wherein the 3D convolution layer consists of a convolution kernel with the size of 5 multiplied by 7 and the size of 1 multiplied by 2, a BN layer, a ReLU activation function and a 3D maximum pooling layer with the size of 1 multiplied by 3 and the size of 1 multiplied by 2;

The SheffleNet-V2 network is composed of a convolution layer, a pooling layer, a full connection layer, a grouping convolution, a depth separable convolution and the like; TCN (temporalconvolutionalnetwok) the network consists of a number of residual blocks, and the TCN network maps the time-indexed sequence of the feature vectors extracted by the ShuffleNet-V2 network into a new sequence by using 1D temporal convolution, resulting in a lip motion feature with dimensions 512 x 64.

4. Model training

In this embodiment, the implementation platform of the audio-visual combined voice separation method based on the dual-visual clues is based on a Linux operating system, the programming language is python3.8, the deep learning framework is pytorch1.11.0, the CUDA version is 11.1, and the NVIDIA RTX 2080Ti graphic card is used. Adam is used as an optimizer, the learning rate is 0.00001, the batch size is 8, the total batch is 5000, and the latest model is stored every 500 iterations. And checking the training effect with the verification set every 100 times in the training process, and storing the current optimal model. In the experimental process, the fact that the VoxCeleb2 data set is too large and the training time is too long is not beneficial to the experimental process, and the same partial data are used for training in order to save the time cost and ensure the fairness, so that the comparison of the models is not affected.

The data in the data set are all single sound, the sound signals of two different speakers are mixed randomly during training, the time domain mixed audio is converted into a spectrogram through STFT, the audio is subjected to 16kHz sampling rate, the audio fragment length is 2.55s, and the STFT has 400 window lengths, 160 jump sizes and 512 FFT sizes. The size of the facial image frame is 224x224, the facial features with the dimension of 128 are extracted through a network, the mouth action frame is input into 64 frames of mouth gray frames with the size of 96x96, the mouth features with the dimension of 512x64 are extracted through the network, the mouth features are combined with the facial features to finally obtain visual features with the dimension of 640x64, and the visual features are used as visual input of an audio separation module; the signal spectrogram dimension of the mixed audio is 2x257x256, and a prediction mask consistent with the input spectrogram dimension is obtained after the mixed audio is subjected to network. The prediction masks are multiplied with the spectrograms of the mixed voices respectively to obtain separated independent speaker voice spectrograms, and the speaker clean voice signals are recovered through inverse Short Time Fourier Transform (STFT).

5. Experimental results

In the embodiment, the separation performance of the proposed method and the separation model of audio-visual voice using a single visual cue is compared, and the separation performance of the improved model and the basic model is compared at the same time, so that the effectiveness of the proposal provided by the invention is verified. The evaluation indexes commonly used for the voice separation task include an active distortion ratio SDR, a perception evaluation of voice quality PESQ, short-time objective intelligibility STOI and the like. SDR represents the overall distortion of the signal; the STOI measures the correlation between the short-time envelopes of the reference (clean) utterance and the separate utterances; for speech quality PESQ is a standard metric, applying an auditory transform to produce a response spectrum and comparing the response spectrum of clean reference and separate signals. PESQ, SDR and STOI were used as evaluation indexes for the comparative experiments in this example. In this embodiment only the results of the two speaker mixes are compared and analyzed.

And (3) verifying the effect of the attention mechanism:

in the video feature extraction network, a CBAM attention mechanism is added in the face module of the model provided by the invention to help the network extract key face information, and useless secondary information is ignored for separation, so that the separation performance is improved. The model effect of adding Squeeze Excitation (SE) and CBAM attention mechanisms is compared in this example. It should be noted that the results shown in the table below are for a separation model implemented on the basis of the original U-Net network. The two data inspection effects in the test set are selected, the evaluation index results of the separated voices are shown in table 2, and the experimental results show that the addition of two layers of CBAM attention mechanisms is beneficial to improving the voice separation effect.

TABLE 2 Experimental results of different attentional mechanisms

It can be seen that the addition of the CBAM attention mechanism helps the visual feature extraction module to extract the facial information more accurately, i.e. the key faces in the video stream contain features more beneficial to the separation task, and experimental results further verify the feasibility of the idea. The split model with the addition of the two-layer CBAM attention mechanism improved PESQ score of 0.09 and SDR improvement of 0.8dB compared to the model without the addition of the attention mechanism. Unlike the CBAM attention mechanism, the effect of adding the SE attention mechanism does not promote a separation effect, which may be due to the fact that the processing mechanism of the visual information by the CBAM is closer to the processing mechanism of the visual information by the human brain in audiovisual mode.

And (3) verifying residual connection effect:

inspired by the fact that applying residual connections in the image processing field can help the network to better extract image details, and residual connections have the ability to help the network avoid degradation problems. In an audio signal processing network, the model provided by the invention adds residual connection on the basis of U-Net. The residual connection is used as a bridge to transfer the information of the upper layer to the lower layer when the degradation phenomenon (the phenomenon that the deepening effect of the network layer is reduced) occurs in the deep network. The model provided by the invention adds residual connection in the convolution layer of the compression path, and verifies the improvement condition of the separation effect of the number and the type of the added different residual connection. And selecting two data in the test set for verification, wherein the evaluation index result of the separated voice is shown in the following table. As shown in table 3, the results demonstrate that adding a network model of 6 residuals improves the performance of separating audio compared to a U-Net network without the addition of residual connections.

TABLE 3 experimental results of different residuals

And (3) visual information effect verification:

in order to verify the effectiveness of the use of the dual vision information for the improvement of the voice separation performance, the performance of the voice separation model in which the audio is combined with the mouth information, the audio is combined with the face information, and the audio is combined with the face and lip information is compared in the present embodiment. Specifically, in the experiment, the pure mouth information feature, the pure face information feature and the mouth face fusion feature obtained by the visual feature extraction module in the proposed model are respectively fused to the separation module as final visual features, the network structure of the separation module is kept unchanged, and the result is shown in table 4.

TABLE 4 Experimental effects of different visual cues

The effect of model separation is different when different visual cues are introduced. As can be seen from the table, the dual visual cues used herein can more fully utilize speaker visual and speech information than if the mouth motion or facial information were used alone. Compared with the method using only facial information, the method provided by the invention has the advantages that the PESQ is improved by 0.23, the SDR is improved by 2.5dB, the STOI is improved by 0.08, and better separation performance is realized.

In different application scenarios, the voice separation model built in the invention can be used for voice separation:

firstly, obtaining video and corresponding audio containing two speakers;

inputting the face frame, the mouth action frame and the corresponding audio into a voice separation model constructed by the method;

outputting each separated speaker and the corresponding clean voice.

Example 2:

as shown in fig. 8, the present invention also provides an audio-visual combined speech separation device comprising at least one processor and at least one memory, as well as a communication interface and an internal bus; the memory stores computer executing program; a computer-implemented program of a speech separation model constructed by the construction method described in embodiment 1 is stored in a memory; the processor may be caused to implement a voice separation method when executing a computer-implemented program stored in the memory. Wherein the internal bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (Peripheral Component, PCI) bus, or an extended industry standard architecture (. XtendedIndustry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or one type of bus. The memory may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.

The device may be provided as a terminal, server or other form of device.

Fig. 8 is a block diagram of an apparatus shown for illustration. The device may include one or more of the following components: a processing component, a memory, a power component, a multimedia component, an audio component, an input/output (I/O) interface, a sensor component, and a communication component. The processing component generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component may include one or more processors to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component may include one or more modules that facilitate interactions between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.

The memory is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly provides power to the various components of the electronic device. Power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices. The multimedia assembly includes a screen between the electronic device and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia assembly includes a front camera and/or a rear camera. When the electronic device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component is configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. The I/O interface provides an interface between the processing assembly and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly includes one or more sensors for providing status assessment of various aspects of the electronic device. For example, the sensor assembly may detect an on/off state of the electronic device, a relative positioning of the assemblies, such as a display and keypad of the electronic device, a change in position of the electronic device or one of the assemblies of the electronic device, the presence or absence of user contact with the electronic device, an orientation or acceleration/deceleration of the electronic device, and a change in temperature of the electronic device. The sensor assembly may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly may further include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component is configured to facilitate communication between the electronic device and other devices in a wired or wireless manner. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.

Claims

1. An audio-visual combined voice separation model building method is characterized by comprising the following steps:

2. The audio-visual combined speech separation model building method according to claim 1, wherein the specific process of preprocessing in the step 2 is: firstly, processing a video into an image frame by frame, and selecting a frame as a face frame; each frame of image is used for acquiring facial key points by using an SFD (small form-factor detector), removing the difference related to the position, positioning the position of the lips, then cutting the lips into a fixed size, and taking the lips as a mouth action frame after gray-scale treatment, wherein the frame number is 64; and then randomly selecting the data of two speakers from the original data, mixing the audio frequencies in the data, and then performing short-time Fourier transform on the mixed voice to obtain a spectrogram of the voice, and combining face frames and mouth action frames corresponding to the data of the two speakers to construct a data set.

3. The audio-visual combined speech separation model building method according to claim 1, wherein: the audio separation module is improved based on a U-Net network and comprises a conv layer, a res_conv layer, audio-visual feature fusion, an up_conv layer and a Tanh function;

4. The audio-visual combined speech separation model building method according to claim 1, wherein: the face module is improved based on a ResNet-18 network and comprises a conv7 layer, a CBAM layer, a res layer, a pooling layer and a linear layer;

the CBAM layer consists of Channel Attention and Spatial Attention, is respectively positioned before the first res layer and after the last res layer and is used for efficiently extracting and audio-related face key areas and ignoring secondary areas outside the face;

y = ReLU(x + BN(conv3(ReLU(BN(conv3(x))))))

y = ReLU ( Downsample(x) + BN(conv3(ReLU (BN(conv3(x))))))

wherein downsampled refers to a downsampling layer;

5. The audio-visual combined speech separation model building method according to claim 1, wherein: the mouth action module is constructed based on a SheffleNet-V2 and TCN network structure and combined with a 3D convolution layer, wherein the 3D convolution layer consists of a convolution kernel with the size of 5 multiplied by 7 and the step size of 1 multiplied by 2, a BN layer, a ReLU activation function and a 3D maximum pooling layer with the size of 1 multiplied by 3 and the step size of 1 multiplied by 2;

6. The audio-visual combined speech separation model building method according to claim 1, wherein: in the training process, the model constructed in the step 4 uses complex domain ideal ratio masking cIRM as a training target of audio, and uses a triple loss to calculate similarity between the audio and the facial image, wherein the calculation formula of the cIRM is as follows:

7. The audio-visual combined speech separation model building method according to claim 1, wherein: in the step 2, the time domain mixed audio is converted into a spectrogram through short-time Fourier transform, the audio is subjected to 16kHz sampling rate, the audio fragment length is 2.55s, and the STFT has 400 window lengths, 160 jump sizes and 512 FFT sizes.

8. The audio-visual combined speech separation model building method according to claim 1, wherein the input of the audio separation module is specifically: regarding visual input, the facial image frame size is 224×224, extracted via a network into facial features of dimension 128; the mouth action frame is input into 88 x 88, extracted into 512 x 64 mouth features through a network, and combined with facial features to finally obtain 640 x 64 visual features, wherein the visual features are used as visual input of an audio separation module; regarding audio input, the signal spectrogram of the mixed audio is taken as the visual input of the audio separation module, the dimension is 2×257×256, and a prediction mask consistent with the dimension of the input spectrogram is obtained after the network.

9. An audiovisual combined speech separation method, characterized by comprising the following processes:

acquiring video and corresponding audio containing two speakers;

inputting a face frame, a mouth action frame and corresponding audio into a speech separation model constructed by the construction method according to any one of claims 1 to 8;

Outputting each separated speaker and the corresponding clean voice.

10. An audiovisual combined speech separation device, characterized by: the apparatus includes at least one processor and at least one memory, the processor and the memory coupled; a computer-implemented program of a speech separation model constructed by the construction method according to any one of claims 1 to 8 is stored in the memory; the processor may be caused to implement a voice separation method when executing a computer-implemented program stored in the memory.