CN115691539A

CN115691539A - Two-stage voice separation method and system based on visual guidance

Info

Publication number: CN115691539A
Application number: CN202211317835.0A
Authority: CN
Inventors: 魏莹; 邓媛洁; 张寒冰
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-02-03

Abstract

The invention provides a two-stage voice separation method and a two-stage voice separation system based on visual guidance, wherein in the first stage, the voices of speakers are obtained from the obtained mixed voice in a time domain; in the second stage, the independent voice features with the speaker information are extracted by means of the roughly separated voice in the first stage, and meanwhile, potential relevant features and complementary features between visual and audio modes are mined and multi-mode features are fused and separated, and finally pure target voice is obtained. The invention extracts the independent voice characteristics of the speaker by utilizing the first stage, avoids introducing pure reference voice, improves the voice separation performance and robustness through visual guidance, and solves the problem of label arrangement. The invention further improves the voice separation quality by dynamically adjusting the weight of the two-stage model, and the disclosed voice separation system is suitable for most application scenes.

Description

Two-stage voice separation method and system based on visual guidance

Technical Field

The invention belongs to the technical field of voice separation, and relates to a two-stage voice separation method and system based on visual guidance.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Speech separation refers to the extraction of one or more target speech signals from the mixed speech generated by multiple speakers. The speech separation problem comes from the "cocktail party effect," which describes a phenomenon where there are many different types of sound sources simultaneously present in a noisy indoor environment, such as a cocktail party. But even so, people can focus on one's conversation and ignore other conversations or noise in the background. This is a hearing selection capability of a person. We wish to have this ability to select and screen speech through machine learning. Speech separation has a wide range of applications, is a fundamental AND important ring in many Speech downstream tasks, AND can be better applied to Speech Recognition [ TORFI A, IRANMANESH S M, NASCABADI N, et al.3D relational Neural Networks for Cross Audio-Visual Matching Recognition [ J ]. IEEE Access,2017] [ T.AFOURAS J C, A.SENIR, O.VINYS, AND A.ZISSMAN.deep Audio-Visual Speech Recognition [ J ]. IEEE Transactions on Pattern Analysis & machinery introduction, 2018], human-Machine interaction, AND other scenarios.

Conventional speech separation algorithms include Computational Auditory Scene Analysis (CASA), non-negative matrix decomposition, hidden Markov models, etc., which usually require certain assumed conditions and a priori knowledge [ JOSH, MCDERMOTT. The social gap resolution [ J ]. Current Biology,2009], and also have the problem of poor effect in separating similar speaker voices [ Rivet B, wang W, naqvi S M, et al. Audio speech sound source separation: an overview of key locations [ J ]. IEEE Signal Processing Mazine, 2014,31 (3): 125-134 ]. With the development of deep learning, the neural network also obtains a better effect in the field of voice separation, and the neural network can learn a complex mapping relation between a mixed voice signal and a target voice signal. The common pure audio method has the problem of label arrangement when Training a separation network, although a correct matching item can be screened out through Permutation and combination, namely, permutation Invariant Training (PIT), the calculation is more complex; in addition, speech is greatly affected by the environment and noise, resulting in poor robustness of the separation system.

In a real-world scenario, people assist their own auditory perception by observing the speaker, and hearing is easier when looking at the face or lips of the speaker. Secondly, in some areas of psychology and physiology, as well as psychology, there are also studies [ Golombic E Z, cogan G B, schroeder C E, et al. Visual input opportunities selective speech tracking in audio surveillance at a 'recording property' J ] Journal of Neuroscience,2013,33 (4): 1417-1426 ] that visual information is helpful for understanding speech. Compared with voice, visual information of the speaker, such as lip movement, face appearance and the like, is more stable, and meanwhile, the visual information has identity characteristics and can be matched with a correct speaker tag in the process of separating and mixing voice. In the separation model proposed by Ephrat in 2017, only a static image is introduced, and although the data dimension is reduced, the information on the visual time sequence is lost, and the separation performance is lost.

Some studies [ AFOURAS T J S, ZISSERMAN A. My lip area constrained Audio-Visual Speech enhancement section [ J ]. Interspeed, 2019,4295-9 ] [ OCHIAI T, DELCROIX M, KINOSHITA K, et al. Multimodal Speakermam: single Channel Target Speech Extraction with Audio-Visual Speaker sources Clues [ M ]. Interspeed 2019.2019-22 ] ] [ R.Gu, S. -X.Zhang, Y.Xu, L.Chen, Y.Zu, and D.Yu, "Multi-modal Multi-Channel Target destination, and" Speech in "distance" Extraction of the additional features proposed in the IEEE reference 2020. IEEE sample Extraction to improve the effect. Wang et al used x-vector for speaker recognition in the separation model, and Luo et al introduced i-vector for speaker's pure speech to aid separation. However, the method has two disadvantages, namely, the training of the separation model can be carried out only by inputting the pure reference voice of the speaker in advance; secondly, when the model is trained and then applied to the actual separation scene, the pure speech of the speaker to be separated is required to be separated, so that the method has great limitation in the actual application scene.

Disclosure of Invention

The invention provides a two-stage voice separation method and a two-stage voice separation system based on visual guidance to solve the problems.

According to some embodiments, the invention adopts the following technical scheme:

a two-stage voice separation method based on visual guidance comprises the following steps:

in the first stage, separating the obtained mixed voice on a time domain to obtain roughly separated speaker voice;

in the second stage, the independent voice characteristics with speaker information are extracted by means of the pure audio frequency separation result in the first stage, then potential relevant characteristics and complementary characteristics between the visual mode and the audio mode are mined, the visual characteristics and the voice time-frequency domain characteristics are fused and then separated, and the weights of the two stages are dynamically adjusted to finally obtain pure target voice.

As an alternative embodiment, the first stage specifically includes:

coding the obtained mixed voice by using a coder, and extracting the characteristics of the mixed voice;

and separating the mixed voice characteristics to obtain a mask of the target voice, determining the target voice characteristics, decoding the target voice characteristics, and obtaining a coarsely separated target voice time domain signal.

As a further limitation, the specific process of separating the mixed speech features includes processing the mixed speech features by using a first separation network, where the first separation network is a time convolution network structure and includes a normalization layer and a plurality of identical stack modules, each of which is composed of a full convolution, an expanded convolution and a residual module, and an output of the last stack module passes through a convolution layer and a prilu activation layer to obtain a separated target mask.

By way of further limitation, the target speech feature is calculated by multiplying the mask of the mixed speech and the target speech.

As an alternative embodiment, the second stage specifically includes:

converting the mixed voice to obtain a complex spectrum image of the mixed voice, and acquiring a complex spectrum mask of the real pure voice according to the complex spectrum image;

converting the time domain signal of the target voice acquired in the first stage to obtain a separated complex spectrogram of each speaker; and extracting the independent voice characteristics of each speaker by the composite spectrogram through an independent voice characteristic extraction network ResNet-18.

Acquiring visual information of a speaker in time synchronization with mixed voice and preprocessing the visual information, and respectively extracting static visual characteristics and dynamic visual characteristics from the preprocessed visual image, wherein the static visual characteristics comprise identity information of the speaker with distinctiveness and have similarity with sound characteristics such as tone and the like; the dynamic visual features comprise content information of the voice, have similarity with sound features such as phonemes and the like, and simultaneously, by combining the two visual features, more voice related features can be obtained while less dimension information is processed;

the mixed voice feature extraction network extracts voice features from the time-frequency domain information of the mixed voice, multi-modal feature fusion is carried out, the separation network 2 separates the multi-modal features to obtain a mask of the separated target voice, the mask and a complex spectrogram of the mixed voice are multiplied and then inverse transformation is carried out, and time-domain voice signals of the target speaker are obtained through joint training and dynamic optimization weights of two stages of separation.

As a further limitation, the specific process of transforming the mixed speech to obtain the complex spectrogram of the mixed speech includes performing short-time fourier transform on the mixed speech signal, and then calculating a real part and an imaginary part of the mixed speech signal respectively to obtain the complex spectrogram, where the complex spectrogram includes amplitude and phase information of the speech.

As a further limitation, performing time-frequency domain conversion on the roughly separated voice in the first stage to obtain a complex spectrogram; then, using an independent voice feature extraction network ResNet-18 to extract independent voice features from the complex spectrogram of each speaker; the independent speech features are then time dimension converted to achieve dimensional consistency of the audio and video modality features.

As a further limitation, the specific process of acquiring and preprocessing the visual information of the speaker in time synchronization with the mixed voice comprises reading a video file, intercepting a video with a set length to acquire a multi-frame image sequence, and randomly selecting a frame of facial image as static visual information; and then cutting each frame of image sequence, selecting a lip region with a set size to reduce data dimensionality, and generating a file of the lip sequence as dynamic visual information.

As a further limitation, the specific process of extracting the visual features comprises the steps of normalizing the lip image and filling data, the preprocessed lip data sequentially passes through a dynamic visual feature extraction network comprising a 3D convolutional layer, a shuffle net v2 and a time convolutional network to extract time series features, more content information of the fitting voice can be obtained, and finally the obtained lip features are obtained;

the face image is standardized and processed in size, the features containing speaker identity information are extracted through a static visual feature extraction network ResNet-18, the face features are converted in the time dimension, and the converted features and lip sequence features have the same time dimension.

As a further limitation, the specific process of performing multi-modal feature fusion includes firstly obtaining mixed speech features from the mixed sound spectrogram through a mixed speech feature extraction network, then splicing the visual features, the independent speech features and the mixed speech features of the speaker in a cascading manner, and finally obtaining fused multi-modal features.

By way of further limitation, the multi-modal features are separated using a second separation network, which is an upsampling network layer of U-Net.

By way of further limitation, the weighting of the loss function of the two-stage separation network is dynamically adjusted to take advantage of the independent speech features of the first stage to the maximum extent to assist in the separation of the second stage.

A vision-guidance-based two-stage speech separation system, comprising:

a first separation module configured to, at a first stage, separate the obtained mixed speech in a time domain to obtain a coarsely separated speaker speech;

the second separation module is configured to extract the sound characteristics with the speaker information by means of the pure audio separation result of the first stage, simultaneously excavate potential relevant characteristics and complementary characteristics between the visual mode and the audio mode, perform fusion of the visual characteristics and the voice time-frequency domain characteristics, then separate the two modes, and dynamically adjust the weight of the two stages to finally obtain the separated target voice;

and the dynamic weight adjusting module dynamically adjusts the weight of the two stages according to the performance of the separation model of the two stages so as to utilize the independent voice characteristics extracted in the first stage to the maximum extent to assist the second stage and realize the voice separation of the pure target speaker.

Compared with the prior art, the invention has the beneficial effects that:

(1) The two-stage voice separation scheme based on visual guidance provided by the invention can extract the voice characteristics of a single speaker to assist voice separation under the condition of only mixed voice, thereby avoiding introducing extra pure reference voice.

(2) The invention simultaneously utilizes the dynamic visual characteristics containing the voice content information and the static visual characteristics containing the identity information to mine the potential correlation and complementarity between the visual mode and the audio mode, simultaneously solves the problem of label arrangement in pure audio voice separation, avoids the computation complexity of a loss function, and simultaneously improves the separation effect and the robustness of a separation system.

(3) The invention provides a method for dynamically adjusting loss function weight aiming at two-stage voice separation. Under the condition of simultaneously optimizing two training targets, the independent voice characteristics of the voice extraction in the first stage are utilized to the maximum extent to assist the separation in the second stage, and finally pure voice with higher performance index is obtained.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a two-stage speech separation method based on visual guidance according to the present invention.

Fig. 2 is a flowchart of a method for extracting visual features according to the present invention.

FIG. 3 is a flowchart illustrating a method for dynamically adjusting the weighting coefficients of the loss function according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The invention provides a two-stage voice separation method and a two-stage voice separation system based on visual guidance.

The specific process comprises the following steps:

1. first stage pure audio time domain separation

1. And acquiring mixed voice. Randomly selecting and reading pure voice wav files of two speakers (taking two-speaker mixed voice separation as an example), interceptingThe read time domain voice signal is sampled with a fixed length (taking 2.55s as an example), the sampling rate is 16kHz, normalization is carried out, and the sampling rate is respectively marked as x _A ,x _B Adding the two pure voices to obtain a mixed voice x _mix ＝x _A +x _B 。

2. Mixed speech x _mix And extracting the mixed voice characteristics through an encoder. The encoder uses a layer of one-dimensional convolutional network.

3. The mixed voice features pass through the separation network 1 to obtain the mask of the separated target voice. The separation Network 1 uses a Time Convolutional Network (TCN) structure. The TCN comprises a normalization layer (group normalization and one-dimensional convolution in sequence), n identical stack modules, each of which is composed of a full convolution, an inflation convolution and a residual module. And the output of the last stack module passes through the convolutional layer and the PReLU active layer to obtain a separated target mask.

4. Mixing the voices x _mix And multiplying the target voice with the mask of the target voice respectively to obtain the target voice characteristics of the corresponding speaker.

5. The obtained target voice feature is processed by a decoder to obtain a time domain signal of the target voice, which is recorded as x' _A ,x’ _B . The decoder uses a one-layer one-dimensional transposed convolutional network.

2. Second stage multi-modal time-frequency domain separation

6. Acquiring time-frequency domain information of mixed voice-complex spectrogram (cIRM) marked as S _mix . The complex spectrogram is an ideal ratio mask of a complex number field, three-dimensional information of time, frequency and energy can be represented by a two-dimensional plane, and the real part and the imaginary part avoid the separation quality reduction caused by the loss of phase information of a pure amplitude spectrum. Firstly, a short-time Fourier transform (STFT) is performed on the mixed speech signal, and then a real part and an imaginary part of the STFT are calculated respectively to obtain a complex spectrogram. Wherein the STFT transformation is shown in equation (1).

Where x (m) is the input speech signal and w (m) is a window function. X (n, ω) is a two-dimensional function of time n and frequency ω. In the present invention, the window function length window _ size may be 400, the number of audio samples hop _ size between adjacent STFT columns is 160, that is, the number of overlapped audio samples between adjacent windows is 160, the length n _ fft of the windowed signal after zero padding is 512, and the size of the resulting complex spectrogram is 2 × F × T, that is, 2 × 257 × 256, where 2 represents two dimensions, a real part and an imaginary part, and F and T represent frequency and time dimensions, respectively.

7. Obtaining a corresponding complex spectrogram by STFT (space time transformation) of the pure speech of the time domain, and obtaining a complex spectrum mask M of the real pure speech according to the complex spectrogram of the mixed speech _A ，M _B 。

Wherein Y is _r And Y _i Respectively representing the real and imaginary components, S, of the complex spectrogram of mixed speech _r And S _i Representing the real and imaginary components of the clean speech complex spectrogram, respectively. From the ideal complex mask and the mixed speech, a clean speech can be obtained, the formula is as follows:

S＝M*Y (3)

where denotes a negative number multiplication.

8. And extracting the independent voice characteristics of the time domain signal output by the first stage. Separated Speech x 'to first stage' _A ,x’ _B Performing time-frequency domain conversion, and obtaining the separated complex spectrogram S 'of each speaker by STFT (STFT transform) (as shown in formula (1))' _A ,S’ _B Then obtaining the voice characteristics of each speaker through ResNet-18 network, and recording the voice characteristics as alpha _A 、α _B Dimension 128 x 1. The extracted voice features come from the separated voice, so that the identity characteristics of the speaker are represented to a certain extent, and effective identity information can be provided for the separation of the second stage.

9. And acquiring visual information of the speaker in time synchronization with the mixed voice and preprocessing the visual information. First read the video file, truncate 2.55s in length. The sampling rate of the video is 75 frames/second, so that a 64-frame image sequence can be obtained. In order to keep the identity characteristics of the speaker and reduce the complexity of data processing, a frame of facial image is randomly selected as static visual information; then, the 64-frame image sequence is cut, a lip region with the size of 88 × 88 is selected, and an h5 file of the lip sequence is generated to serve as dynamic visual information.

10. And respectively extracting static visual features containing identity information and dynamic visual features containing voice content information. Firstly, lip images are normalized and data are filled, so that the precision and the stability of a feature extraction model are improved. The preprocessed lip data sequentially passes through a 3D convolution layer, a ShuffleNet v2 network and a TCN to extract time series characteristics, and finally the obtained lip characteristic dimension is 512 x 1 x 64 and is marked as f _{lip_A} ，f _{lip_B} . The facial image is then normalized and sized to speed up the convergence of the feature extraction model. Since the face image has 3 channels for color images, the size of the preprocessed face data is 3 × 224, and features with dimension 128 × 1 are extracted through a residual network ResNet-18. In order to fuse the lip feature and the facial feature, the facial feature needs to be copied in the time dimension, and the converted lip sequence feature and the lip sequence feature have the same time dimension, namely 128 × 1 × 64, which is marked as f _{face_A} ，f _{face_B} 。

11. And acquiring voice features of the mixed voice and performing multi-mode feature fusion. First, a mixed sound complex spectrogram S _mix And (4) obtaining a mixed voice feature mix through a U-Net down-sampling network layer, wherein the dimension is 512 x 1 x 64. The speech feature alpha of the separated speech is then extracted _A 、α _B Converting in time dimension to keep the same as visual feature dimension, and cascading visual feature f of speaker _{lip_A} ，f _{lip_B} ，f _{face_A} ，f _{face_B} And the sound characteristic alpha _A ，α _B ，α _mix Make a spliceFinally, the fused multimodality feature dimension is 2048 × 1 × 64.

12. The multi-modal characteristics pass through a separation network 2 to obtain a mask M of the separated target voice " _A ，M” _B . The separation network 2 is composed of an upsampling network layer of U-Net.

13. Mask the target M " _A ，M” _B And mixed speech complex spectrogram S _mix Multiplying respectively to obtain the second-stage separated complex spectrogram S of the speaker " _A ,S” _B . To S' _A ,S” _B Performing inverse short-time Fourier transform (iSTFT) to obtain time-domain speech signal x of the target speaker " _A ,x” _B . The calculation formula is as follows:

S” _A ＝M” _A *S _mix (4)

S” _B ＝M” _B *S _mix (5)

x” _A ＝iSTFT(S” _A ) (6)

x” _B ＝iSTFT(S” _B ) (7)

3. dynamically adjusting two-stage loss function weights

In the present invention, the two stages of the separation process and the individual network modules are trained simultaneously to achieve the optimization goal. The loss function that defines the overall network architecture is as follows:

loss＝λ ₁ loss ₁ +λ ₂ loss ₂ (8)

loss ₂ ＝||M _A -M’ _A ||+||M _B -M’ _B || (10)

among them, loss ₁ And loss ₂ Respectively, a training loss function of two stages, lambda ₁ And λ ₂ Are the training weights of the two loss functions, respectively. For loss ₁ ，x _target Is defined as

x _noise Is defined as s' -x _target Where s' represents the isolated speech signal and s represents the clean speech signal.

In order to obtain effective speaker voice characteristics from the separated voice in the first stage and further improve the separation effect in the second stage, the invention provides a method for dynamically adjusting the weight of the two-stage loss function. Since the quality of the speech separated in the first stage is poor and the quality of the corresponding extracted speech feature of the speaker is also relatively poor in the initial training state, the weight of the loss function in the two stages is set to be lambda ₁ ＝λ ₂ =1, separate networks of two phases are trained and optimized simultaneously. With the improvement of the quality of the separated voice provided by the first stage, the corresponding voice characteristics are more distinctive, so that the separation effect of the second stage can be obviously improved. When the separation effect of the two stages reaches a threshold value relation, setting the weight as lambda respectively ₁ ＝1，λ ₂ =2, the separation network of the second stage is trained with emphasis. The threshold relationship at this time can be judged by the separation loss of the two stages. Let us assume that the loss of the single-stage pure audio speech separation is a negative value of the Source-to-Distortion Ratio (SDR), defined as e ₀ The loss of speech separation in the first stage is a negative value of SDR for separating speech, defined as e ₁ The loss of the second stage of speech separation is a negative value of SDR of the second stage of separated speech, defined as e ₂ . When e is ₁ -e ₀ <e ₁ -e ₂ It is shown that the speech separated in the first stage improves the separation effect in the second stage, and the speech feature is effective for the second stage, so that the weight assignment adjustment is performed. The definition of SDR is as follows:

wherein s is _target ，e _interf ，e _noise And e _artif Respectively represent the object theorySpeech of a speaker, interference from other speakers, interference from noise, and interference from other artifacts.

Example two

A vision-guidance-based two-stage speech separation system, comprising:

a first separation module configured to separate the obtained mixed speech in a time domain to obtain an independent speaker speech in a first stage;

the second separation module is configured to extract independent voice features with speaker information by means of the pure audio separation result of the first stage, simultaneously excavate potential relevant features and complementary features between the visual modality and the audio modality, perform fusion of the visual features and the voice time-frequency domain features, and then separate the two modalities to finally obtain separated target voice;

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A two-stage voice separation method based on visual guidance is characterized by comprising the following steps:

in the second stage, firstly, the independent sound characteristics with the speaker identity information are extracted from the roughly separated speech in the first stage, then the potential relevant characteristics and complementary characteristics between the visual mode and the audio mode are mined, secondly, the two modes of the visual characteristic and the speech time-frequency domain characteristic are fused and then separated, and finally the separated target speech is obtained after the weights of the two stages are dynamically adjusted.

2. The two-stage voice separation method based on visual guidance as claimed in claim 1, wherein the first stage comprises the following specific processes:

and separating the mixed voice characteristics by using a first separation network to obtain a mask of the target voice, multiplying the mask and the mixed voice characteristics, and then decoding to obtain a time domain signal of the target voice.

3. The method as claimed in claim 2, wherein the step of separating the mixed speech features in the first stage comprises processing the mixed speech features by using a first separation network, wherein the first separation network is a time convolution network structure and comprises a normalization layer and a plurality of identical stack modules, each of which comprises a full convolution layer, a dilation convolution layer and a residual error module, and the output of the last stack module passes through the convolution layer and the PReLU activation layer to obtain a separated target mask;

or, the target voice feature is obtained by multiplying the mixed voice feature and the mask of the target voice, and the target voice feature is decoded to obtain the time domain signal of the target voice.

4. The two-stage voice separation method based on visual guidance as claimed in claim 1, wherein the second stage comprises the following specific processes:

transforming the mixed voice to obtain a complex spectrum image of the mixed voice, and acquiring a complex spectrum mask of the real pure voice according to the complex spectrum image;

converting the time domain signal of the target voice acquired in the first stage to obtain a separated complex spectrogram of each speaker, and extracting the independent voice feature of each speaker;

and acquiring visual information of the speaker in time synchronization with the mixed voice, preprocessing the visual information, and respectively extracting static visual features and dynamic visual features from the preprocessed visual image.

Extracting mixed voice features from the time-frequency domain information of the mixed voice complex spectrogram, performing multi-modal feature fusion, separating the multi-modal features to obtain a mask of the separated target voice, multiplying the mask and the complex spectrogram of the mixed voice, and performing inverse transformation to obtain a pure voice signal of the target speaker.

5. The method as claimed in claim 4, wherein the step of converting the time domain signal of the target speech obtained in the first stage comprises: firstly, roughly separating voice, and carrying out time-frequency domain conversion to obtain a complex spectrogram; then, using an independent voice feature extraction network ResNet-18 to extract independent voice features for the complex spectrogram of each speaker; the independent speech features are then time dimension transformed to achieve dimensional consistency of the audio and video modality features.

6. The two-stage voice separation method based on visual guidance as claimed in claim 4, wherein the specific process of obtaining and pre-processing the visual information of the speaker in time synchronization with the mixed voice comprises reading a video file, intercepting a video with a set length to obtain a multi-frame image sequence, and randomly selecting a frame of facial image as static visual information; and then cutting each frame of image sequence, selecting a lip region with a set size, and generating a file of the lip sequence as dynamic visual information.

7. The two-stage voice separation method based on visual guidance as claimed in claim 6, wherein the specific process of extracting visual features includes normalizing lip images and data filling, the preprocessed lip data sequentially passes through a 3D convolutional layer and a ShuffleNet v2 network, and then passes through a time convolutional network structure to extract time series features, and finally dynamic visual features are obtained, wherein the dynamic visual features include content information of voice;

standardizing and sizing the face image, and extracting static visual features through a residual error network ResNet-18, wherein the static visual features comprise identity information of speakers with distinctiveness; the static visual features are transformed to have the same time dimension as the dynamic visual sequence features.

8. The two-stage voice separation method based on visual guidance as claimed in claim 4, wherein the specific process of performing multi-modal feature fusion includes firstly obtaining mixed voice features by a mixed voice complex spectrogram through a U-Net down-sampling network layer, then performing cascade splicing on visual features, independent voice features and mixed voice features of a speaker, and finally obtaining fused multi-modal features;

or separating the multi-modal features by utilizing a second separation network, wherein the second separation network is an up-sampling network layer of the U-Net.

9. The method as claimed in claim 1, wherein the weight of the loss function for the two-stage speech separation is dynamically adjusted to maximally utilize the independent speech features of the first stage to assist the separation of the second stage.

10. A two-stage speech separation system based on visual guidance, comprising:

the first separation module is configured to separate the acquired mixed voice in a time domain to obtain independent speaker voice in a first stage;

the second separation module is configured to extract independent voice features with speaker information by means of the pure audio frequency separation result of the first stage at the second stage, simultaneously excavate potential relevant features and complementary features between the visual modality and the audio modality, perform fusion of the visual features and the voice time-frequency domain features, and then separate the two modalities to finally obtain separated target voice;