CN115938385A

CN115938385A - Voice separation method and device and storage medium

Info

Publication number: CN115938385A
Application number: CN202110945149.7A
Authority: CN
Inventors: 卢慧君; 蔡敦波; 钱岭; 黄智国
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2021-08-17
Filing date: 2021-08-17
Publication date: 2023-04-07
Also published as: WO2023020500A1

Abstract

The embodiment of the application discloses a voice separation method, a voice separation device and a storage medium, wherein the method inputs mixed audio data and image data of a time domain into a first neural network for feature fusion and outputs K first feature graphs; inputting the spectrogram of the frequency domain into a second neural network for feature separation, and outputting K second feature maps; obtaining K spectrogram masks based on the K first feature maps and the K second feature maps; and finally, obtaining K separated independent audio data based on the K speech spectrogram masks and the speech spectrogram. Therefore, when voice separation is carried out, the first neural network is introduced for multi-perception feature extraction to enhance voice features to obtain K first feature maps, the second neural network is introduced to carry out K component separation on a spectrogram of mixed voice data to obtain K second feature maps, and prediction accuracy can be improved by carrying out spectrogram mask prediction by using the first feature maps and the second feature maps, so that effective separation of mixed audio data is realized.

Description

Voice separation method and device and storage medium

Technical Field

The present application relates to audio and video signal processing technologies, and in particular, to a method and an apparatus for separating voices, and a storage medium.

Background

At present, a video playing website can automatically generate subtitles for videos such as movies, televisions, short videos and the like based on an automatic subtitle generating technology, and the generated subtitles have a good effect. In practice, however, the automatic caption generation technique generates erroneous scrambling codes in a caption generated in a scene where a plurality of speakers appear. In this case, it is necessary to separate the mixed sounds and reproduce the subtitles before the subtitles are generated, which involves the problem of multi-speaker voice separation. The problem of separating voices of multiple speakers is a classical problem in the field of voice signal processing, and needs to be continuously researched and improved.

Disclosure of Invention

In order to solve the foregoing technical problem, embodiments of the present application are intended to provide a voice separation method, apparatus, and storage medium.

The technical scheme of the application is realized as follows:

in a first aspect, a speech separation method is provided, including:

acquiring original video data;

inputting the original video data into a voice separation model for voice separation processing to obtain K independent audio data; wherein K is a positive integer, and the voice separation processing comprises:

extracting mixed audio data and image data from the original video data;

inputting the mixed audio data and the image data into a first neural network for feature fusion, and outputting K first feature maps;

carrying out short-time Fourier transform on the mixed audio data to obtain a spectrogram;

inputting the spectrogram into a second neural network for feature separation, and outputting K second feature maps;

obtaining K spectrogram masks based on the K first feature maps and the K second feature maps;

and obtaining K separated independent audio data based on the K speech spectrogram masks and the speech spectrogram.

In the foregoing solution, the feature fusion includes:

converting and cutting an image in the image data based on a preset image size to obtain a multi-frame first image;

performing three-dimensional convolution operation and pooling operation on the multiple frames of first images, and extracting a third feature map corresponding to the image data;

performing convolution operation and pooling operation on the mixed audio data, and extracting a fourth feature map corresponding to the mixed audio data;

inputting the third characteristic diagram and the fourth characteristic diagram into a splicing subnet for splicing according to channels to obtain a fifth characteristic diagram;

inputting the fifth feature map into a three-dimensional residual sub-network, and outputting the K first feature maps.

In the above solution, the three-dimensional residual subnetwork includes: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer and a maximum temporal-spatial pooling layer; wherein the sixth convolutional layer is a 3 × 3 × 3 convolutional layer outputting K channel feature maps;

the fifth feature map sequentially passes through the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the maximum temporal-spatial pooling layer to obtain the K first feature maps.

In the foregoing solution, the obtaining K spectrogram masks based on the K first feature maps and the K second feature maps includes: and performing product operation on the K first feature maps and the K second feature maps according to channels to obtain K spectrogram masks.

In the foregoing solution, the obtaining K separated independent audio data based on the K spectrogram masks and the spectrogram includes: performing product operation on the K spectrogram masks and the spectrogram to obtain K spectrogram; and performing short-time Fourier inverse transformation on the K spectrogram to obtain the K independent audio data.

In the foregoing solution, the voice separation processing further includes: and determining image position information of K speakers in the image data based on the K spectrogram masks.

In the above scheme, the method further comprises: acquiring N segments of video data from the audio-visual voice data set; wherein, N is an integer greater than 1, and each video data segment comprises independent audio data and image data; mixing audio and images of at least two sections of video data in the N sections of video data according to a preset multi-person conversation scene to obtain video sample data; forming a training sample set of the voice separation model by using all video sample data; and training the voice separation model by using the training sample set to obtain the trained voice separation model.

In the above scheme, the loss function of the speech separation model is the pixel cross entropy loss of the predicted speech spectrogram mask and the actual speech spectrogram mask obtained by the speech separation model.

In a second aspect, a speech separation apparatus is provided, comprising:

the device comprises an acquisition unit, a voice separation module and a processing unit, wherein the acquisition unit is used for acquiring original video data and inputting the original video data into a voice separation model;

the voice separation model is used for carrying out voice separation processing on the original video data to obtain K independent audio data; wherein K is a positive integer, and the voice separation processing comprises:

extracting mixed audio data and image data from the original video data;

In a third aspect, there is provided a speech separation apparatus, including: a processor and a memory configured to store a computer program capable of running on the processor,

wherein the processor is configured to perform the steps of the aforementioned method when running the computer program.

For example, the apparatus may be an electronic device with a voice separation function, or may be a chip applied to the electronic device. In this application, the apparatus may implement the functions of the multiple units by means of either software or hardware or a combination of software and hardware, so that the apparatus may perform the voice separation method provided in any one of the above first aspects.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the aforementioned method.

The embodiment of the application provides a voice separation method, a voice separation device and a storage medium, wherein the method inputs original video data into a voice separation model for voice separation processing, specifically, inputs mixed audio data and image data of a time domain into a first neural network for feature fusion, and outputs K first feature maps; inputting the spectrogram of the frequency domain into a second neural network for feature separation, and outputting K second feature maps; obtaining K spectrogram masks based on the K first feature maps and the K second feature maps; and finally, obtaining K separated independent audio data based on the K speech spectrogram masks and the speech spectrogram. Therefore, when voice separation is carried out, the first neural network is introduced for multi-perception feature extraction to enhance voice features to obtain K first feature maps, the second neural network is introduced to carry out K component separation on a spectrogram of mixed voice data to obtain K second feature maps, and prediction accuracy can be improved by carrying out spectrogram mask prediction by using the first feature maps and the second feature maps, so that effective separation of mixed audio data is realized.

Drawings

FIG. 1 is a first flowchart of a speech separation method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a feature fusion method in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a UNET network in an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a structure of a speech separation model according to an embodiment of the present application;

FIG. 5 is a second flowchart of a speech separation method according to an embodiment of the present application;

FIG. 6 is a spectrogram of mixed audio data according to an embodiment of the present application;

FIG. 7 is a spectrogram of first independent audio data in an embodiment of the present application;

FIG. 8 is a spectrogram of second independent audio data according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a first speaker tracking localization in an embodiment of the present application;

FIG. 10 is a schematic diagram of a second exemplary embodiment of speaker tracking localization;

FIG. 11 is a flowchart illustrating a method for constructing and training a speech separation model according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a first component structure of a speech separation apparatus according to an embodiment of the present application;

fig. 13 is a schematic diagram of a second component structure of the speech separation apparatus in the embodiment of the present application.

Detailed Description

So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

The method solves the problem of multi-speaker voice separation, the primary mainstream method is to calculate auditory scene analysis, and with the introduction of deep learning thought, the voice separation method based on deep learning achieves better effect. The embodiment of the present application provides a voice separation method to solve the problem of voice separation of multiple speakers, and the following description illustrates the voice separation method.

Fig. 1 is a first flowchart of a speech separation method in an embodiment of the present application, and as shown in fig. 1, the method may specifically include:

step 101: acquiring original video data;

here, the original video data source may be a video audio capture device, a video database, or a video playing website.

Step 102: inputting original video data into a voice separation model to perform voice separation processing to obtain K independent audio data;

here, the voice separation model has a function of separating mixed voice, and the voice separation model is an end-to-end model, and can directly input original video data and output separated K independent audio data, where K is a positive integer.

Specifically, the voice separation processing of the voice separation model includes:

step 1021: extracting mixed audio data and image data from original video data;

step 1022: inputting the mixed audio data and the image data into a first neural network for feature fusion, and outputting K first feature maps;

here, the first neural network is used for multi-perception feature extraction, and when audio feature extraction is performed, video image features are introduced to enhance voice features, so that the first feature map includes time features. The voice separation model combined with the video image characteristics can improve the performance of the voice separation algorithm.

Illustratively, as shown in fig. 2, the feature fusion method may specifically include:

step 201: converting and cutting an image in the image data based on a preset image size to obtain a plurality of frames of first images;

step 202: performing three-dimensional convolution operation and pooling operation on multiple frames of first images, and extracting a third feature map corresponding to image data;

step 203: performing convolution operation and pooling operation on the mixed audio data, and extracting a fourth feature map corresponding to the mixed audio data;

step 204: inputting the third characteristic diagram and the fourth characteristic diagram into a splicing sub-network for splicing according to channels to obtain a fifth characteristic diagram;

step 205: and inputting the fifth feature map into a three-dimensional residual sub-network, and outputting K first feature maps.

In practice, the three-dimensional residual sub-network may be a Resnet network of different depths, such as a Resnet network of 18, 34, 50, 101, or 152 depths. The deeper the residual sub-network depth, the stronger the ability of the network to extract features, so the residual sub-network depth can be selected according to the voice separation performance requirements.

In some embodiments, a three-dimensional residual subnetwork comprises: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer and a maximum temporal-spatial pooling layer; wherein, the sixth convolution layer is a 3 × 3 × 3 convolution layer with the output of K channel feature maps;

the fifth characteristic diagram sequentially passes through the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the time-space maximum pooling layer to obtain K first characteristic diagrams.

TABLE 1 Resnet network depth of 18

/>

It can be understood that the embodiment of the present application makes the following modifications to the existing Resnet network: the last average pooling layer and full-link layer are deleted, a 3 × 3 × 3 convolutional layer with K channel feature maps as output is added, and a spatio-temporal maximum pooling operation layer (spatialitemp max pooling) is added.

Step 1023: performing short-time Fourier transform on the mixed audio data to obtain a spectrogram;

the speech Spectrogram (Sonogram/Spectrogram) is prepared by performing short-time Fourier transform on a speech signal, and then representing the amplitude by using color with the horizontal axis as time and the vertical axis as frequency. The frequency and amplitude of the signal are shown in a graph along with the change of time, so the graph is also called a time-frequency spectrogram.

In some embodiments, the short-time fourier transforming the mixed audio data to obtain a spectrogram, includes: performing down-sampling operation on the mixed audio data to obtain down-sampled audio data; carrying out short-time Fourier transform on the down-sampled audio data to obtain a spectrogram; and performing log transformation on the spectrogram to obtain a log-transformed spectrogram which is used as the input of the second neural network.

Illustratively, the mixed speech data is down-sampled prior to a short-time Fourier transform, e.g., at a sampling rate of 11kHz and a mixed speech data length of about 6 seconds. And secondly, converting the time domain to the frequency domain of the mixed voice data to obtain a spectrogram. For example, the time domain to frequency domain is converted into a Short-time Fourier Transform (SIFT), which uses a hamming window with a size of 1022 and a step size of 256. And moreover, log transformation is carried out on the spectrogram, and the operation of regularizing the spectrogram is realized. And finally, inputting the spectrogram after log transformation into a second neural network for K component separation.

Step 1024: inputting the spectrogram into a second neural network for feature separation, and outputting K second feature maps;

here, the second neural network is configured to perform K component separation on the spectrogram of the mixed speech data, convert the input mixed speech data into a spectrogram of a frequency domain, and output K second feature maps of the mixed speech data.

Illustratively, the second neural network is a UNET network. The UNET network has a K component separation function on a spectrogram of mixed voice.

Fig. 3 is a schematic structural diagram of the UNET network in the embodiment of the present application, and as shown in fig. 3, the UNET network 30 is a deep learning framework, generally including an encoding process and a decoding process, and the UNET network 30 includes: the device comprises an input module, a down-sampling module, an up-sampling module and an output module, wherein the input module and the down-sampling module realize an encoding process, and the up-sampling module and the output module realize a decoding process. Inputting the spectrogram into UNET network 30 for K component separation, and outputting K second characteristic maps, i.e. S ₁ 、S ₂ To S _K 。

In practical applications, the UNET network 30 has a jump connection between the down-sampling module and the up-sampling module at each layer, or a Squeeze and Excite (SE) connection.

Step 1025: obtaining K spectrogram masks based on the K first feature maps and the K second feature maps;

illustratively, the obtaining K spectrogram masks based on the K first feature maps and the K second feature maps includes: performing product operation on the K first feature maps and the K second feature maps according to channels to obtain K spectrogram masks;

specifically, the formula for calculating the Spectrogram mask (Sonogram/spectrum mask) is as follows:

wherein K =1,2 _k Is a first characteristic diagram, S _k Is a second feature graph, sigma is a sigmoid function, th is a threshold value, invalid information is filtered to obtain valid information, namely a spectrogram mask b is obtained _k 。

In practical applications, the spectrogram mask may be a binary mask or a ratio mask.

Step 1026: and obtaining the K separated independent audio data based on the K speech spectrogram masks and the speech spectrogram.

In some embodiments, deriving the K separated independent audio data based on the K spectrogram masks and the spectrogram, includes: performing product operation on the K spectrogram masks and the spectrogram respectively to obtain K spectrogram; and performing short-time Fourier inverse transformation on the K spectrogram to obtain K independent audio data.

For example, frequency-domain to time-domain conversion is an Inverse Short-time Fourier Transform (ISIFT).

Any one of the voice separation methods provided by the embodiment of the application can be applied to a subtitle generation technology, and corresponding text information is extracted from K separated independent audio data; and adding the text information to the image according to the voice time sequence.

By adopting the voice separation model, when voice separation is carried out, the first neural network is introduced for multi-perception feature extraction to enhance voice features to obtain K first feature maps, the second neural network is introduced to carry out K component separation on a spectrogram of mixed voice data to obtain K second feature maps, and the first feature map and the second feature map are utilized to carry out spectrogram mask prediction, so that the prediction accuracy can be improved, and the effective separation of mixed audio data is realized.

Based on the foregoing embodiment, a speech separation model is illustrated, fig. 4 is a schematic structural diagram of a speech separation model in the embodiment of the present application, and as shown in fig. 4, the speech separation model 40 may be divided into: a first neural network 401, a second neural network 402, and a product operation part 403. Wherein the content of the first and second substances,

the first neural network 401 includes two input data paths, i.e., mixing audio data and image data, uniformly converting an image of the image data into 256 × 256, randomly cutting 224 × 224, and performing a series of 3D convolution operations and pooling operations to reduce a time sampling rate; the mixed audio data is also subjected to convolution operation and pooling operation, so that the sampling rate of the voice data and the sampling rate of the image data are kept consistent, and the voice data and the image data can be spliced according to channels through a splicing layer.

For example, if the third feature map of the image data is C1 × H × W, and the fourth feature map of the mixed audio data is C2 × H × W, after channel splicing, the size of the obtained fifth feature map is (C1 + C2) × H × W. Extracting respective shallow layer features of the audio and the image through convolution operation and pooling operation, and then performing feature fusion.

And inputting the fifth feature map output by the splicing layer into a three-dimensional residual error sub-network for multi-feature perception. For example, the three-dimensional residual sub-network obtains 256 × k of a first feature map containing time information, which is denoted as m _k K =1,2,.., K, the present embodiment modifies the existing Resnet network as follows: the last average pooling layer and full-link layer are deleted, a 3 × 3 × 3 convolutional layer with K channel feature maps as output is added, and a spatio-temporal maximum pooling operation layer (spatialitemp max pooling) is added.

The second neural network 402 includes a path of input data, and down samples the mixed voice data, for example, the sampling rate is 11kHz, and the length of the mixed voice is about 6 seconds. And secondly, SIFT is carried out on the mixed voice to obtain a spectrogram. And moreover, log transformation is carried out on the spectrogram, and the operation of regularizing the spectrogram is realized. Finally, the spectrogram after log transformation is input into a second neural network 402 for K component separation to obtain K second characteristic maps which are marked as S _k K =1,2.., K, size 256 × K.

Illustratively, the second neural network is a UNET network.

A product operation part 403 for outputting the first feature map m output by the first neural network 401 _k And a second characteristic diagram S _k And performing product operation according to the channels, and predicting to obtain K spectrogram masks. Exemplary, the formula for the spectrogram mask is as follows:

and finally, scalar product is carried out on the K spectrogram masks and the spectrogram input by the second neural network 402, and K separated independent audio data are obtained through the ISTFT.

In some embodiments, the speech separation model is also capable of locating the pixel location of the speaker in the image. Fig. 5 is a second flow chart of the speech separation method in the embodiment of the present application, and as shown in fig. 5, the speech separation processing process of the speech separation module may specifically include:

step 501: extracting mixed audio data and image data from original video data;

step 502: inputting the mixed audio data and the image data into a first neural network for feature fusion, and outputting K first feature maps;

the first neural network is used for multi-perception feature extraction to enhance voice features, video image features are introduced when audio feature extraction is carried out, and the performance of a voice separation algorithm can be improved by combining a voice separation model of the video image features.

Step 503: performing short-time Fourier transform on the mixed audio data to obtain a spectrogram;

in some embodiments, the short-time fourier transforming the mixed audio data to obtain a spectrogram, includes: performing down-sampling operation on the mixed audio data to obtain down-sampled audio data; performing short-time Fourier transform on the down-sampled audio data to obtain a spectrogram; and performing log transformation on the spectrogram to obtain a log-transformed spectrogram which is used as the input of the second neural network.

Step 504: inputting the spectrogram into a second neural network for feature separation, and outputting K second feature maps;

Step 505: obtaining K spectrogram masks based on the K first feature maps and the K second feature maps;

illustratively, the obtaining K spectrogram masks based on the K first feature maps and the K second feature maps includes: and performing product operation on the K first feature maps and the K second feature maps according to channels to obtain K spectrogram masks.

Step 506: obtaining K separated independent audio data based on the K speech spectrogram masks and the speech spectrograms;

in some embodiments, deriving the separated K independent audio data based on the K spectrogram masks and the spectrogram, includes: performing product operation on the K spectrogram masks and the spectrogram respectively to obtain K spectrogram; and performing short-time Fourier inverse transformation on the K spectrogram to obtain K independent audio data.

Fig. 6 is a spectrogram of mixed audio data, in which two human voices are mixed, in an embodiment of the present application, fig. 7 is a spectrogram of first independent audio data in an embodiment of the present application, and fig. 8 is a spectrogram of second independent audio data in an embodiment of the present application.

Step 507: and determining image position information of K speakers in the image data based on the K spectrogram masks.

Illustratively, the speech spectrogram mask is multiplied by each frame of image in the image data to determine the image position information of the speaker in each frame of image. The image position information is used for tracking and positioning the position of the speaker in the image.

FIG. 9 is a schematic diagram of a first speaker tracking and locating in the present application, and FIG. 10 is a schematic diagram of a second speaker tracking and locating in the present application. The hatched area in fig. 9 and 10 is the image position information of the speaker.

By adopting the technical scheme, when voice separation is carried out, not only can effective separation of mixed audio data be realized, but also the pixel position of a speaker in a video image can be positioned and tracked, and a user is helped to correspond the voice and the speaker.

On the basis of the foregoing embodiment, the method further includes constructing and training a speech separation model, fig. 11 is a schematic flow diagram of the speech separation model construction and training method in the embodiment of the present application, and as shown in fig. 11, the speech separation model construction and training method may specifically include:

step 1101: constructing a voice separation model and a training sample set;

illustratively, taking fig. 4 as an example, an end-to-end speech separation model is constructed by using the first neural network, the second neural network, the product operation part, the STFT and the ISTFT.

Illustratively, in some embodiments, the method further comprises: setting a K value of a voice separation model according to the voice separation demand quantity and the interference audio type; wherein the interfering audio categories include at least noise, silence, and other non-human audio.

Here, the speech separation requirement number refers to the number of speakers that the speech separation model can maximally separate. For example, the number of speech separation requirements is 10, and K may be set to 16, while taking into account the effects of noise, silence, and other non-human audio.

Here, the training sample set may be one or more existing video data sets.

Illustratively, in some embodiments, constructing a training sample set comprises:

acquiring N segments of video data from the audio-visual voice data set; wherein, N is an integer greater than 1, and each video data segment comprises independent audio data and image data;

mixing audio and images of at least two sections of video data in the N sections of video data according to a preset multi-person conversation scene to obtain video sample data;

and forming a training sample set by using all video sample data.

The audio visual speech data set (AVSpeech) contains about 4700 hours of video segments, each segment being 3-10 seconds long. These video segments cover approximately 150000 speakers, with clear speech and no interference. The age, gender, language and video capture perspective of the speaker are different, and only one person appears in each video clip.

When a training sample set is constructed, N video segments can be selected from the training sample set and are marked as { I _n ,S _n Where N = {1, …, N }. I is _n Image frames representing an nth video segment; s _n Representing the voice of the nth video; the mixed speech is generated by the following formula, i.e. linear mixing.

Exemplary, multi-person conversation scenarios fall into several scenarios:

1) The video image only has one person, the mixed voice is two persons, and the formula N =2 corresponds to;

2) There are two people in the video (synthesized), corresponding to their mixed speech, corresponding to the above formula N =2;

3) There are three people in the video (synthesized), corresponding to their mixed speech, corresponding to the above formula N =3;

4) There are five people in the video (synthesized), corresponding to their mixed speech, corresponding to the above formula N =5;

constructing a training sample set containing the four multi-person conversation scenes, and simultaneously calculating a real spectrogram mask M of an audio spectrogram in the nth video _n It is selected whether the independent audio is the largest value in each time-frequency unit in the mixed audio. Namely, it is

M _n (u,v)＝S _n (u,v)≥S _m (u, v), for any m = (1, …, N). Here (u, v)Is the coordinate position in the spectrogram, and S represents the spectrogram.

Specifically, S _n (u, v) is a spectrogram of the independent audio in the nth video, which represents S of each independent audio in the mixed audio _m (u, v), comparing the same position of the matrix of the two spectrogram to realize binarization processing, if S is the case _n (u, v) is maximum, then the position is 1, otherwise 0 is taken, thus the real spectrogram mask M of the independent audio in the nth video segment _n May also be represented by b _truth 。

Step 1102: and training the voice separation model by using the training sample set to obtain the trained voice separation model.

Illustratively, the loss function of the speech separation model is a pixel cross entropy loss of the speech separation model resulting in the predicted speech spectrogram mask and the actual speech spectrogram mask.

I.e. a loss function of L = BCE (b) _k ,b _truth )

Wherein, b _k Speech pattern mask predicted by speech separation model, b _truth Is a true spectrogram mask.

Therefore, a new voice separation model is obtained, the mixed audio data of the video images containing the multi-person conversation can be effectively separated, and the independent audio data of each person can be obtained. The method is applied to the subtitle generation technology, the accuracy of the subtitle can be improved, and the generation of messy codes is reduced.

In order to implement the method of the embodiment of the present application, based on the same inventive concept, an embodiment of the present application further provides a speech separation apparatus, as shown in fig. 12, where the apparatus 120 includes:

an obtaining unit 1201, configured to obtain original video data and input the original video data to the voice separation model;

a voice separation model 1202, configured to perform voice separation processing on original video data to obtain K independent audio data; wherein the voice separation process includes:

extracting mixed audio data and image data from original video data;

performing short-time Fourier transform on the mixed audio data to obtain a spectrogram;

and obtaining the K separated independent audio data based on the K speech spectrogram masks and the speech spectrogram.

In some embodiments, feature fusion comprises: converting and cutting an image in the image data based on a preset image size to obtain a multi-frame first image; performing three-dimensional convolution operation and pooling operation on the multiple frames of first images, and extracting a third feature map corresponding to image data; performing convolution operation and pooling operation on the mixed audio data, and extracting a fourth feature map corresponding to the mixed audio data; inputting the third characteristic diagram and the fourth characteristic diagram into a splicing subnet for splicing according to channels to obtain a fifth characteristic diagram; and inputting the fifth feature map into a three-dimensional residual sub-network, and outputting K first feature maps.

In some embodiments, the three-dimensional residual subnetwork comprises: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer and a maximum temporal-spatial pooling layer; wherein, the sixth convolution layer is a 3 × 3 × 3 convolution layer with the output of K channel feature maps;

the fifth feature map sequentially passes through the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the maximum temporal-spatial pooling layer to obtain K first feature maps.

In some embodiments, the deriving K spectrogram masks based on the K first feature maps and the K second feature maps includes: and performing product operation on the K first feature maps and the K second feature maps according to channels to obtain K spectrogram masks.

In some embodiments, the short-time fourier transforming the mixed audio data to obtain a spectrogram, includes:

performing down-sampling operation on the mixed audio data to obtain down-sampled audio data;

performing short-time Fourier transform on the down-sampled audio data to obtain a spectrogram;

and performing log transformation on the spectrogram to obtain a log-transformed spectrogram which is used as the input of the second neural network.

In some embodiments, the second neural network is a UNET network.

In some embodiments, deriving the separated K independent audio data based on the K spectrogram masks and the spectrogram, includes:

performing product operation on the K spectrogram masks and the spectrogram respectively to obtain K spectrogram;

and performing short-time Fourier inverse transformation on the K spectrogram to obtain K independent audio data.

In some embodiments, the speech separation process further comprises: and determining image position information of K speakers in the image data based on the K speech spectrogram masks.

In some embodiments, the apparatus 120 further comprises a construction unit (not shown in fig. 12) for constructing the speech separation model, and constructing the training sample set; and training the voice separation model by using the training sample set until the loss function meets the convergence condition to obtain the trained voice separation model.

In some embodiments, a construction unit is configured to obtain N pieces of video data from a set of audio-visual speech data; wherein, N is an integer greater than 1, and each video data segment comprises independent audio data and image data; mixing audio and images of at least two sections of video data in the N sections of video data according to a preset multi-person conversation scene to obtain video sample data; and forming a training sample set by using all video sample data.

In some embodiments, the constructing unit is further configured to calculate a true spectrogram mask corresponding to the audio data in each piece of video sample data; the loss function is the pixel cross entropy loss of the predicted spectrogram mask and the real spectrogram mask obtained by the voice separation model.

In some embodiments, the constructing unit is further configured to set a K value of the speech separation model according to the number of speech separation requirements and the type of the interfering audio; wherein the interfering audio classes include at least noise, silence, and other non-human audio.

The device 120 may be an electronic device having a voice separation function, or may be a chip applied to an electronic device. In this application, the apparatus may implement the functions of the multiple units through either software or hardware or a combination of software and hardware, so that the apparatus can perform the voice separation method provided in any of the above embodiments.

By adopting the voice separation device, when voice separation is carried out, the first neural network is introduced for multi-perception feature extraction to enhance voice features to obtain K first feature maps, the second neural network is introduced for carrying out K component separation on a spectrogram of mixed voice data to obtain K second feature maps, and prediction accuracy can be improved by utilizing the first feature maps and the second feature maps to carry out spectrogram mask prediction, so that effective separation of mixed audio data is realized.

Based on the hardware implementation of each unit in the foregoing voice separating apparatus, an embodiment of the present application further provides another voice separating apparatus, as shown in fig. 13, where the apparatus 130 includes: a processor 1301 and a memory 1302 configured to store a computer program operable on the processor;

wherein the processor 1301 is configured to execute the method steps in the preceding embodiments when running the computer program.

Of course, in actual practice, as shown in fig. 13, the various components of the device 130 are coupled together by a bus system 1303. It is understood that the bus system 1303 is used to enable connection communication between these components. The bus system 1303 includes a power bus, a control bus, and a status signal bus, in addition to the data bus. But for clarity of illustration the various buses are labeled in figure 13 as the bus system 1303.

In practical applications, the processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.

The Memory may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD), or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.

In practical applications, the device may be an electronic device with a voice separation function, or may be a chip applied to an electronic device. In this application, the apparatus may implement the functions of the multiple units through either software or hardware or a combination of software and hardware, so that the apparatus can perform the voice separation method provided in any of the above embodiments. And the technical effects of the technical solutions of the apparatus can refer to the technical effects of the corresponding technical solutions in the voice separation method, which is not described in detail herein.

In an exemplary embodiment, the present application further provides a computer readable storage medium, such as a memory including a computer program, which is executable by a processor of a speech separation apparatus to perform the steps of the foregoing method.

Embodiments of the present application also provide a computer program product comprising computer program instructions.

Optionally, the computer program product may be applied to the speech separation apparatus in the embodiment of the present application, and the computer program instructions enable the computer to execute a corresponding process implemented by the speech separation apparatus in each method in the embodiment of the present application, which is not described herein again for brevity.

The embodiment of the application also provides a computer program.

Optionally, the computer program may be applied to the speech separation apparatus in the embodiment of the present application, and when the computer program runs on a computer, the computer is enabled to execute corresponding processes implemented by the speech separation apparatus in the methods in the embodiment of the present application, and for brevity, details are not described here again.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The expressions "having", "may have", "including" and "containing", or "may include" and "may contain" in this application may be used to indicate the presence of corresponding features (e.g. elements such as values, functions, operations or components) but do not exclude the presence of additional features.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another, and are not necessarily used to describe a particular order or sequence. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention.

The technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and device may be implemented in other ways. The above-described embodiments are merely illustrative, and for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or in other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims

1. A method of speech separation, the method comprising:

acquiring original video data;

extracting mixed audio data and image data from the original video data;

2. The method of claim 1, wherein the feature fusion comprises:

converting and cutting the image in the image data based on a preset image size to obtain a plurality of frames of first images;

inputting the third characteristic diagram and the fourth characteristic diagram into a splicing sub-network for splicing according to channels to obtain a fifth characteristic diagram;

3. The method of claim 2, wherein the three-dimensional residual sub-network comprises: a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer and a maximum temporal-spatial pooling layer; wherein the sixth convolutional layer is a 3 × 3 × 3 convolutional layer outputting K channel feature maps;

4. The method according to claim 1, wherein the deriving K spectrogram masks based on the K first feature maps and the K second feature maps comprises:

and performing product operation on the K first feature maps and the K second feature maps according to channels to obtain K spectrogram masks.

5. The method of claim 1, wherein the deriving the separated K independent audio data based on the K spectrogram masks and the spectrogram comprises:

and performing short-time Fourier inverse transformation on the K spectrogram to obtain the K independent audio data.

6. The method of claim 1, wherein the speech separation process further comprises:

and determining image position information of K speakers in the image data based on the K spectrogram masks.

7. The method of claim 1, further comprising:

forming a training sample set of the voice separation model by using all video sample data;

and training the voice separation model by using the training sample set to obtain the trained voice separation model.

8. A speech separation apparatus, the apparatus comprising:

extracting mixed audio data and image data from the original video data;

9. A speech separation apparatus, the apparatus comprising: a processor and a memory configured to store a computer program capable of running on the processor,

wherein the processor is configured to perform the steps of the method of any one of claims 1 to 7 when running the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.