CN116778946A

CN116778946A - Separation method of vocal accompaniment, network training method, device and storage medium

Info

Publication number: CN116778946A
Application number: CN202310575515.3A
Authority: CN
Inventors: 刘百云
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-09-19

Abstract

The embodiment of the application provides a separation method of vocal accompaniment, a network training method, equipment and a storage medium. The method comprises the following steps: determining mixed audio to be separated; the mixed audio is formed by mixing accompaniment and human voice; generating target audio corresponding to the accompaniment by utilizing a generator in a first generation reactance network aiming at the mixed audio; the first generation reactance network is obtained by carrying out countermeasure training according to a training sample corresponding to the first generation reactance network; generating target audio corresponding to the human voice by using a generator in a second generation countermeasure network aiming at the mixed audio; the second generated countermeasure network is obtained by performing countermeasure training according to training samples corresponding to the second generated countermeasure network. The resource processing method provided by the embodiment of the application can improve the audio quality of finally separated human voice and accompaniment.

Description

Separation method of vocal accompaniment, network training method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method for separating vocal accompaniment, a method for network training, a device, and a storage medium.

Background

The separation of the vocal accompaniment is an important problem in the processing of music signals, and can separate the vocal accompaniment from the accompaniment in the same material to generate a new independent audio file, so that an creator can conveniently and respectively carry out subsequent artistic processing of the accompaniment or the vocal accompaniment in other styles.

In recent years, a neural network-based method has been greatly developed in the field, and a new idea is provided for the field by utilizing the high-precision, high-performance and strong model expression capability of the neural network-based method.

However, at present, the neural network is used to separate the voice and accompaniment in the same material, and the voice and accompaniment obtained has tone color distortion.

Disclosure of Invention

In view of the above problems, the present application has been made to provide a human voice accompaniment separation method, a network training method, an apparatus, and a storage medium that solve or at least partially solve the above problems.

Accordingly, in one embodiment of the present application, a vocal accompaniment separating method is provided. The method comprises the following steps:

determining mixed audio to be separated; the mixed audio is formed by mixing accompaniment and human voice;

generating target audio corresponding to the accompaniment by utilizing a generator in a first generation reactance network aiming at the mixed audio; the first generation reactance network is obtained by carrying out countermeasure training according to a training sample corresponding to the first generation reactance network;

Generating target audio corresponding to the human voice by using a generator in a second generation countermeasure network aiming at the mixed audio; the second generated countermeasure network is obtained by performing countermeasure training according to training samples corresponding to the second generated countermeasure network.

In yet another embodiment of the present application, a network training method is provided. The method comprises the following steps:

constructing a first generation antagonism network and a second generation antagonism network;

performing countermeasure training on a generator and a discriminator in the first generation countermeasure network by using training samples corresponding to the first generation countermeasure network until loss functions corresponding to the generator and the discriminator in the first generation countermeasure network converge;

performing countermeasure training on a generator and a discriminator in the second generated countermeasure network by using training samples corresponding to the second generated countermeasure network until loss functions corresponding to the generator and the discriminator in the second generated countermeasure network respectively converge;

the first generation reactance network generator is used for separating target audio corresponding to accompaniment in mixed audio formed by mixing the accompaniment and human voice; the second generation countermeasure network generator is used for separating out target audio corresponding to the human voice in the mixed audio.

In yet another embodiment of the present application, an electronic device is provided. The electronic device includes: a memory and a processor, wherein,

the memory is used for storing programs;

the processor is coupled to the memory for executing the program stored in the memory to implement the method of any one of the above.

In a further embodiment of the application, a computer readable storage medium is provided, storing a computer program, which when executed by a computer is capable of carrying out the method of any one of the above.

In the technical scheme provided by the embodiment of the application, two different learned generation countermeasure networks are utilized to separate the voice and the accompaniment from the mixed audio respectively, that is, each generation countermeasure network only needs to be responsible for extracting one of the voice and the accompaniment, so that each generation countermeasure network only needs to be responsible for learning the signal distribution characteristic of one of the voice and the accompaniment. Moreover, through the generation of the countermeasure training, the generator in the countermeasure network can learn the signal distribution of higher-dimensional abstraction, thereby being beneficial to improving the generation quality of the audio corresponding to the voice and the audio corresponding to the accompaniment. Therefore, the technical scheme provided by the embodiment of the application can improve the audio quality of finally separated human voice and accompaniment.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for separating vocal accompaniment according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating an internal structure of a generator in a first generation reactance network according to an embodiment of the present application;

FIG. 3a is a network configuration diagram of a preprocessing sub-network according to an embodiment of the present application;

FIG. 3b is a first internal block diagram of a U-Net network;

FIG. 4a is a schematic diagram illustrating an internal structure of a generator in a second generation countermeasure network according to an embodiment of the present application;

FIG. 4b is a second internal block diagram of the U-Net network;

FIG. 4c is an internal block diagram of a post-processing module;

FIG. 5 is a flow chart of a determining method according to an embodiment of the application;

fig. 6 is a flowchart of a method for separating vocal accompaniment according to another embodiment of the present application;

FIG. 7 is a flowchart of a network training method according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The problem of separation of vocal accompaniment is a problem of separation of source, and the intention of separating vocal accompaniment from mixed signal has been developed as an important research hotspot in the field of signal processing research. Solutions to this problem are applicable in a variety of fields including song recognition, singer recognition, music style recognition, repair, music content detection, etc. The separation of vocal accompaniment can be used as a preprocessing part of the applications, so that source signals which are clearer and closer to the requirements of application scenes can be provided for the applications.

However, the traditional voiceprint recognition, beam forming or non-negative matrix factorization methods are simple to implement and have small calculation amount, but have certain limitations. First, it is easily limited by the input signal, and the separation effect is poor when the distribution of the human voice and accompaniment energy is uneven. Secondly, it is easily interfered by external factors, and noise and distortion problems cannot be effectively dealt with. Therefore, how to improve the effect of separation of vocal accompaniment to meet the requirements of real-world applications has become an important point of attention of researchers. With the rapid development of deep learning technology, the separation problem can be naturally expressed as a supervised learning problem.

Currently, the input of a neural network for separation of vocal accompaniment is usually a Time-Frequency diagram of an audio signal or features extracted therefrom, and the output is also two choices, one is a Time-Frequency diagram and the other is a Time-Frequency Mask (Time-Frequency Mask). Because the time-frequency diagram has wider dynamic range, on one hand, the model needs to carry out complex nonlinear operation to separate the frequency components of the voice and accompaniment, and on the other hand, the number of convolution layers and pooling layers is increased to enhance the nonlinear processing capability of the neural network, and meanwhile, more distortion is introduced. Therefore, the previous methods are more prone to predicting time-frequency masks. However, it is difficult to realize that the separated human voice has no tone distortion and keep the smoothness and continuity of accompaniment by the same model, and mutual exclusion relation is difficult to achieve in the training process, so that the network is difficult to achieve a balance that the separated human voice and accompaniment keep high tone quality. Furthermore, most of the vocal accompaniment separation algorithms currently only aim at mono audio, which is cumbersome to handle for multi-channel audio.

Aiming at the problems, the embodiment of the application provides a voice accompaniment separation method based on the generation of an antagonism network end-to-end, which comprises the following steps: the two different generation countermeasure networks are utilized to respectively learn the accompaniment signal distribution and the vocal signal distribution in the mixed audio, so that the targeted learning is realized; also, by generating countermeasure training for the countermeasure network, the generators in the generated countermeasure network can be made to learn a signal distribution that is more highly abstract. Therefore, the technical scheme provided by the embodiment of the application can improve the audio quality of finally separated human voice and accompaniment.

In order to enable those skilled in the art to better understand the present application, the following description will clearly and completely describe the technical solution according to the embodiments of the present application according to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Furthermore, in some of the flows described in the specification, claims, and drawings above, a plurality of operations occurring in a particular order may be included, and the operations may be performed out of order or concurrently with respect to the order in which they occur. The sequence numbers of operations such as 101, 102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Fig. 1 is a schematic flow chart of a separation method for vocal accompaniment according to an embodiment of the present application. The execution subject of the method can be a client or a server. The client may be hardware integrated on the terminal and provided with an embedded program, or may be an application software installed in the terminal, or may be a tool software embedded in an operating system of the terminal, which is not limited in the embodiment of the present application. In this embodiment, the client is mainly facing the user, and an interface may be provided for the user. The terminal has various implementation forms, such as a smart phone, a smart speaker, a personal computer, a wearable device, a tablet personal computer and the like. The terminal typically comprises at least one processing unit and at least one memory. The number of processing units and memories depends on the modification and type of terminal. The Memory may include volatile such as RAM, nonvolatile such as Read-Only Memory (ROM), flash Memory, or the like, or both. The memory typically stores an Operating System (OS), one or more application programs, program data, and the like. In addition to the processing unit and the memory, the terminal comprises some basic modifications, such as a network card chip, an IO bus, an audio-visual component (e.g. a microphone), etc. Optionally, the terminal may also include some peripheral devices, such as a keyboard, mouse, stylus, printer, etc. Such peripheral devices are generally known in the art and are not described in detail herein. The server may be any device that can provide a computing service, can respond to a service request, and can perform processing, for example, may be a conventional server, a cloud host, a virtual center, and the like. The server mainly comprises a processor, a hard disk, a memory, a system bus and the like, and is similar to a general computer architecture. As shown in fig. 1, the method includes:

101. Mixed audio to be separated is determined.

Wherein the mixed audio is formed by mixing accompaniment and human voice.

102. And generating target audio corresponding to the accompaniment by utilizing a generator in a first generation reactance network aiming at the mixed audio.

The first generation reactance network is obtained by carrying out reactance training according to a training sample corresponding to the first generation reactance network.

103. And generating target audio corresponding to the human voice by using a generator in a second generation countermeasure network aiming at the mixed audio.

The second generated countermeasure network is obtained by performing countermeasure training according to training samples corresponding to the second generated countermeasure network.

In the above 101, the mixed audio is formed by mixing audio corresponding to accompaniment and audio corresponding to human voice.

The above mixed audio may be a song, and the song is obtained by mixing a human voice and an accompaniment. The voice may be a singing voice in the song. The above mixed audio may also be an audio signal extracted from video.

The above mixed audio may be mono audio or multi-channel audio, which is not limited by the embodiment of the present application. The multi-channel mixed audio includes a plurality of channel audio. In this embodiment, a plurality refers to two or more.

In 102 and 103 described above, the generation countermeasure network (Generative adversarial networks, GAN) is a model composed of two mutually competing neural networks, one of which is called a generator (generator) and the other of which is called a discriminator (discriminant). The goal of the generator is to learn the distribution of the input signal, generating a signal similar to the input signal, and the goal of the arbiter is to learn the distribution of the true signal in order to distinguish the generated signal from the true signal. GAN can learn the distribution of high-dimensional abstract signals to achieve efficient separation.

The training samples corresponding to the first generation reactance network comprise first sample mixed audio and expected generation audio corresponding to accompaniment in the first sample mixed audio, namely real audio. The training samples corresponding to the second generation countermeasure network comprise second sample mixed audio and expected generated audio corresponding to human voice in the second sample mixed audio, namely real audio.

In practical application, the first sample mixed audio is obtained by mixing the real audio corresponding to the first sample accompaniment and the real audio corresponding to the first sample human voice. Wherein, the real audio corresponding to accompaniment in the first sample mixed audio refers to the real audio corresponding to the accompaniment of the first sample.

The second sample mixed audio is obtained by linearly adding and mixing the real audio corresponding to the second sample accompaniment and the real audio corresponding to the second sample accompaniment. Wherein, the real audio corresponding to the human voice in the second sample mixed audio refers to the real audio corresponding to the human voice in the second sample mixed audio.

It should be noted that the above-mentioned mixing operation may specifically be a linear addition operation.

In practical application, the number of training samples corresponding to the first generation countermeasure network is multiple, and the format of training samples corresponding to the second generation countermeasure network is also multiple. The training samples corresponding to each of the two may be constructed from a training data set. The training data set mainly adopts 3 voice databases with open sources, namely a musdb18 music track data set, a mir1k data set and a synthesized Lakh (Slakh) data set. Wherein the musdb18 includes 100 songs in total for the training and validation set, and 50 songs for the test set. Each song has 4 tracks, respectively: drum, bass, human voice, among others. The acc.wav file in the dataset is drum, bass, other linearly summed accompaniment audio and the mix.wav file in the dataset is complete mixed audio. The Mir1k dataset contains 1000 audio files, which are files that split the popular chinese song by a non-professional into several seconds long, including accompaniment files (wav) and singing files (vocals wav). The synthetic Lakh (Slakh) dataset is a new dataset for audio separation, which is synthesized from the Lakh MIDI dataset v0.1 using a professional sample-based virtual instrument. The tracks in the Slakh dataset are divided into training (1500 tracks), validation (375 tracks) and test (225 tracks) subsets.

When the training network is needed, the real audio corresponding to one voice and the real audio corresponding to one accompaniment can be read randomly from the data sets, and then linear addition is carried out to form any sample mixed audio.

The goal of the above-described challenge training to generate a challenge network is to reach a Nash balance, that is: the loss functions corresponding to the generators and the discriminators in the generated countermeasure network are converged. When the generation of the countermeasure network training is completed, the generator can be put into use to extract corresponding audio from the mixed audio.

Optionally, in the foregoing 102, "generating, with respect to the mixed audio, the target audio corresponding to the accompaniment by using a generator in the first generation reactance network", may include: inputting the time domain waveform diagram of the mixed audio to a generator in a first generation reactance network, so that the generator generates a target time domain waveform diagram corresponding to the accompaniment according to the time domain waveform diagram; and determining the target audio corresponding to the accompaniment according to the target time domain waveform diagram.

Optionally, the generating, in 103, the target audio corresponding to the voice by using a generator in the second generating countermeasure network, may include: inputting the time domain waveform diagram of the mixed audio to a generator in a second generation countermeasure network, so that the generator generates a target time domain waveform diagram corresponding to the human voice according to the time domain waveform diagram; and determining the target audio corresponding to the voice according to the target time domain waveform diagram.

The scheme is that mixed audio is separated to obtain audio corresponding to accompaniment (or called audio source) and audio corresponding to human voice (or called audio source); wherein, the corresponding audio of accompaniment is also called accompaniment audio; the audio corresponding to the human voice is also called human voice audio.

In general, the time and frequency domains are fundamental properties of a signal, and the different angles used to analyze a signal are referred to as domains. Generally, the time domain is more vivid and visual, the frequency domain analysis is more concise, and the analysis problem is more profound and convenient. Currently, the trend in signal analysis is to move from the time domain to the frequency domain. Accordingly, in the above 102, "generating the target audio corresponding to the accompaniment using the generator in the first generation-reactance network for the mixed audio", may be implemented by:

1021. and generating a second frequency spectrum corresponding to the accompaniment by utilizing a generator in a first generation reactance network according to the first frequency spectrum of the mixed audio.

1022. And generating target audio corresponding to the accompaniment according to the second frequency spectrum corresponding to the accompaniment.

In 1021, a short-time fourier transform (STFT, short-time Fourier transform) may be performed on the mixed audio to obtain a first spectrum of the mixed audio. The first spectrum may include a first magnitude spectrum and a first phase spectrum.

The first spectrum of the mixed audio is input to a generator in a first generation-reactance network, so that a second spectrum corresponding to the accompaniment is generated by the generator. Specifically, the first amplitude spectrum may be input into a generator in a first generation reactance network to generate a second amplitude spectrum corresponding to the accompaniment by the generator; the first phase spectrum is input to a generator in a first generation reactance network to generate a second phase spectrum corresponding to the accompaniment by the generator. Wherein the second spectrum comprises the second magnitude spectrum and a second phase spectrum.

In the above 1022, the target audio corresponding to the accompaniment is generated by Inverse Short Time Fourier Transform (ISTFT) according to the second amplitude spectrum and the second phase spectrum corresponding to the accompaniment.

Optionally, in the step 103, "generating the target audio corresponding to the voice by using a generator in the second generation countermeasure network for the mixed audio", the following steps may be adopted to implement:

1031. and generating a third frequency spectrum corresponding to the human voice by using a generator in a second generation countermeasure network according to the first frequency spectrum of the mixed audio.

1032. And generating target audio corresponding to the voice according to the third frequency spectrum corresponding to the voice.

In 1031, the first spectrum of the mixed audio is input to a generator in a second generation countermeasure network, so that a third spectrum corresponding to the human voice is generated by the generator. Specifically, the first amplitude spectrum may be input into a generator in the second generation countermeasure network to generate a third amplitude spectrum corresponding to the human voice by the generator; the second phase spectrum is input into a generator in a second generation countermeasure network to generate a third phase spectrum corresponding to the human voice by the generator. Wherein the third spectrum comprises the third magnitude spectrum and a third phase spectrum.

In 1032, according to the third amplitude spectrum and the third phase spectrum corresponding to the accompaniment, the inverse short-time fourier transform is performed to generate the target audio corresponding to the voice.

The frequency component distribution of the human voice in the mixed audio is relatively concentrated, stable and regular; whereas the frequency components of accompaniment are distributed over the entire frequency band, with a complex and non-uniform nature. Therefore, it is necessary to design the internal structure of the above-described generator for the human voice section and the accompaniment section, respectively, to improve the sound quality of the separated human voice audio and accompaniment audio.

Because the energy of the voice signal is mainly concentrated in the low-frequency part, the high-frequency part with lower energy is often difficult to process by a general network, and in order to make the network have the same attention to the high-frequency part, the high-frequency part and the low-frequency part are divided, and the divided high-frequency part and the divided low-frequency part are respectively sent into different modules for characteristic extraction. Specifically, as shown in fig. 2, the generator in the first generation reactance network includes: a high-low frequency separation module 201, a high-frequency feature extraction module 202, a low-frequency feature extraction module 203, and a high-low frequency fusion module 204.

In 1021, "generating a second spectrum corresponding to the accompaniment by using a generator in a first generation-reactance network according to the first spectrum of the mixed audio", includes:

S11, according to the first frequency spectrum, the high-frequency characteristic and the low-frequency characteristic corresponding to the mixed audio are separated by the high-frequency and low-frequency separation module.

S12, carrying out accompaniment feature extraction on the high-frequency features by utilizing the high-frequency feature extraction module to obtain accompaniment high-frequency features.

S13, carrying out accompaniment feature extraction on the low-frequency features by utilizing the low-frequency feature extraction module to obtain accompaniment low-frequency features.

S14, carrying out feature fusion on the accompaniment high-frequency features and the accompaniment low-frequency features by utilizing the high-low frequency fusion module so as to obtain a second frequency spectrum corresponding to the accompaniment.

In the above S11, in an example, the high-low frequency separation module may directly separate the high-low frequency signal of the first spectrum, so as to obtain the high-frequency characteristic and the low-frequency characteristic corresponding to the mixed audio.

In another example, the high-low frequency separation module may perform feature extraction on the first spectrum to obtain a first spectrum feature; and then, carrying out high-low frequency signal separation on the first frequency spectrum characteristic to obtain a high-frequency characteristic and a low-frequency characteristic corresponding to the mixed audio.

In an example, the high-low frequency separation module may include: preprocessing the sub-network and separating the network; the preprocessing sub-network can be composed of a 1D convolution and channel normalization layer and is responsible for extracting features of a first frequency spectrum to obtain first frequency spectrum features. The specific structure of the pre-processing sub-network is shown in fig. 3 a.

The separation network is responsible for separating the high-frequency and low-frequency signals of the first frequency spectrum characteristic so as to separate the high-frequency characteristic and the low-frequency characteristic corresponding to the mixed audio.

In S12, the accompaniment feature extraction module is used to extract the accompaniment feature from the high-frequency feature to obtain the accompaniment high-frequency feature.

The high-frequency feature extraction module may include: U-Net network. The U-Net network adopts a "downsampling-upsampling" structure such as the Encoder (Encoder) -Decoder (Decoder) shown in fig. 3b, the Encoder and Decoder are completely symmetrical structures, the Encoder mainly extracts features with good characterization capability, and the Decoder is mainly responsible for separation and reconstruction of audio. The encoder reduces the dimension of the feature map by downsampling and the decoder restores the features to the original dimension using nearest neighbor interpolation. The U-Net network can also comprise: ASPP (Atrous spatial pyramid pooling, hole pyramid pooling) layer models timing dependencies as a bridge between encoder and decoder. In order not to lose the underlying details of the input audio, a Skip-connection (Skip-connection) is made between the encoder and the corresponding layers of the encoder to allow information such as phase or alignment to pass through. In terms of network structure selection, real-time performance and power consumption are both considered, and a residual error (Resnet) network is selected without the RNN type structure.

The number of U-Net networks in the high frequency feature extraction module may be one or two. When the number of the U-Net networks in the high-frequency feature extraction module is two, the high-frequency feature extraction module further comprises: and splicing the sub-modules. The former U-Net network carries out accompaniment feature extraction on the high-frequency features to obtain initial accompaniment high-frequency features; the splicing sub-module splices the initial accompaniment high-frequency characteristic and the high-frequency characteristic to obtain a spliced high-frequency characteristic; and the later U-Net network performs accompaniment feature extraction on the spliced high-frequency features to obtain accompaniment high-frequency features.

In S13, the accompaniment feature extraction module is used to extract the accompaniment feature from the low-frequency feature to obtain the accompaniment low-frequency feature.

The network structure of the low-frequency feature extraction module may be identical to that of the high-frequency feature extraction module, which will not be described in detail in the embodiments of the present application.

In S14, the high-frequency accompaniment feature and the low-frequency accompaniment feature are spliced by using the high-frequency and low-frequency fusion module to obtain a spliced accompaniment feature, and the spliced accompaniment feature is extracted to obtain a second frequency spectrum corresponding to the accompaniment.

Optionally, the high-low frequency fusion module may include: splicing a sub-module and a U-Net network. The splicing sub-module is used for splicing the accompaniment high-frequency characteristic and the accompaniment low-frequency characteristic to obtain a spliced accompaniment characteristic; and the U-Net network is used for extracting the characteristics of the spliced accompaniment characteristics so as to obtain a second frequency spectrum corresponding to the accompaniment.

In this embodiment, the first generation-reactance network generator is implemented by using cascaded U-Net networks, and its structure is composed of multiple U-Net networks, where different U-Net networks focus on topology structures with different granularity and are independent of each other. By adopting a multi-stage rather than single-stage structure, features can be extracted more accurately, thereby achieving more accurate separation results.

In one implementation, as shown in fig. 4a, the generator in the second generation countermeasure network may include: a preprocessing module 401, a feature extraction module 402, and a post-processing module 403. The preprocessing module 401 may include a preprocessing sub-network, which is composed of a 1D convolution and channel normalization layer, and is responsible for extracting features of the first spectrum to obtain a first spectrum feature, and the internal structure of the preprocessing sub-network is shown in fig. 3 a. The feature extraction module 402 may include a U-Net network for performing a human voice feature extraction on the first spectral feature to obtain a human voice feature. The internal structure of the U-Net network can be as shown in FIG. 3b or as shown in FIG. 4 b. Wherein the encoder and decoder in fig. 4b are connected by a bottleneck (Bottle stack). The post-processing module is used for determining a third frequency spectrum corresponding to the voice for the voice characteristics, and comprises the following steps: the layer normalization (LayerNorm) layer, the 2D convolution layer, and the activation function LeakyReLU () layer are formed, and the specific structure is shown in fig. 4 c. In the process of extracting human voice from the mixed signal, the context information (contextual information) has a very important influence on the algorithm effect. If the receptive field of the network is too small, the ability to isolate can be greatly impaired. Therefore, the U-Net network can alternately use the cavity convolution and the common convolution, the receptive field of the generator is greatly improved, and the expansion times of the cavity convolution are gradually increased, for example, the expansion times are 1, 3 and 9; each hole convolution is followed by a normal convolution with a convolution kernel greater than 1.

The cavity convolution (dilated convolutions) can systematically aggregate key staged features in the model and explicitly increase receptive fields, so that the network can capture long-term and short-term dependence to the greatest extent and better excite the learning ability of the network.

The training process for generating the countermeasure network will be described below, and the training process includes the following steps:

105. and performing countermeasure training on the generator and the discriminator in the target generation countermeasure network according to the training samples corresponding to the target generation countermeasure network until the loss functions corresponding to the generator and the discriminator in the target generation countermeasure network are converged.

Wherein the target generation countermeasure network is any one of the first generation countermeasure network and the second generation countermeasure network.

In 105, mixing audio according to samples in the training samples, and generating corresponding generated audio by using a generator; and utilizing a discriminator to discriminate the true and false of expected generated audio and generated audio in the sample mixed audio.

Optionally, constructing a loss function corresponding to the generator according to the discrimination result of the discriminator on the generated audio; and constructing a loss function corresponding to the discriminator according to the discrimination result of the discriminator on the audio to be discriminated. The audio to be discriminated is any one of the generated audio and the desired generated audio.

The convergence of the loss functions corresponding to the generator and the arbiter in the target generation countermeasure network indicates that the target generation countermeasure network reaches Nash balance, and training can be finished.

In the process of countermeasure training, loss function values corresponding to the generators and the discriminators in the countermeasure network are generated according to the targets, and network parameters in the generators and the discriminators are optimized alternately by using a gradient descent algorithm.

In an example, the target generating training samples corresponding to the countermeasure network includes: sample mix audio and desired generated audio; the method further comprises the steps of:

106. and constructing a first loss function corresponding to a generator in the target generation countermeasure network according to the discrimination result of the discriminator in the target generation countermeasure network aiming at the generated audio.

Wherein the generated audio is generated for the sample mixed audio by a generator in the target generation countermeasure network.

107. And constructing a second loss function corresponding to a generator in the target generation countermeasure network according to the difference between the generated audio and the expected generated audio.

108. And determining that the target generates a loss function corresponding to a generator in an antagonism network according to the first loss function and the second loss function.

When the target generation countermeasure network is the first generation countermeasure network, the expected generation audio refers to expected generation audio corresponding to accompaniment in the sample mixed audio.

When the target generation countermeasure network is the second generation countermeasure network, the desired generation audio refers to desired generation audio corresponding to the human voice in the sample mixed audio.

In the above 106, the determining result of the target generation countermeasure network by the determining unit for generating audio may include: the probability that the audio is real audio is generated. And constructing a first loss function corresponding to a generator in the target generation countermeasure network according to the discrimination result of the discriminator in the target generation countermeasure network aiming at the generated audio. The higher the probability of generating the audio as the real audio in the discrimination result is, the smaller the function value of the first loss function corresponding to the generator is.

In 107, a second loss function corresponding to the generator in the target generation countermeasure network may be constructed according to a frequency domain difference and/or a time domain difference between the generated audio and the desired generated audio.

In 108, in one example, a sum of the first and second loss functions may be determined to be the corresponding loss function of the generator in the target generation countermeasure network.

In the embodiment, not only the discrimination condition of the discriminator is considered, but also the generation loss of the generator is considered, which is helpful for improving the model training effect and further improving the separation capability of the generator.

In one implementation manner, the "constructing the second loss function corresponding to the generator in the target generation countermeasure network according to the difference between the generated audio and the expected generated audio" in the above 107 may be implemented by:

1071. and respectively carrying out short-time Fourier transform on the generated audio based on the transformation parameters corresponding to the various frequency resolutions, so as to obtain fourth frequency spectrums corresponding to the various frequency resolutions.

1072. And respectively carrying out short-time Fourier transform on the expected generated audio based on the transformation parameters corresponding to the plurality of frequency resolutions, so as to obtain a fifth frequency spectrum corresponding to the plurality of frequency resolutions.

1073. Constructing a frequency domain loss function corresponding to a generator in the target generation countermeasure network according to the difference between a fourth frequency spectrum and a fifth frequency spectrum corresponding to each frequency resolution in the plurality of frequency resolutions; wherein the second loss function includes the frequency domain loss function.

In 1071, the transformation parameters may include: frame length, window length (window length is equal to frame length), frame shift (frame shift can be set to half the window size). The parameter values of the transformation parameters corresponding to different frequency resolutions are different. In general, the shorter the frame length, the smaller the frequency resolution corresponding to the frequency spectrum and the larger the time resolution; the longer the frame length, the greater the frequency resolution corresponding to the spectrum and the smaller the time resolution.

The plurality of frequency resolutions includes a first frequency resolution, which refers to any one of the plurality of frequency resolutions. Specifically, short-time fourier transform is performed on the generated audio based on a transformation parameter corresponding to the first frequency resolution, so as to obtain a fourth frequency spectrum corresponding to the first frequency resolution.

Specifically, in 1072, short-time fourier transform is performed on the desired generated audio based on the transform parameter corresponding to the first frequency resolution, so as to obtain a fifth frequency spectrum corresponding to the first frequency resolution.

In the above 1073, specifically, a frequency domain loss function corresponding to the first frequency resolution is constructed according to a difference between the fourth spectrum and the fifth spectrum corresponding to the first frequency resolution; constructing a frequency domain loss function corresponding to a generator in the target generation countermeasure network according to the frequency domain loss function corresponding to each of the plurality of frequency resolutions; wherein the second loss function includes the frequency domain loss function. In one embodiment, a sum of frequency domain loss functions corresponding to each of a plurality of frequency resolutions may be determined to generate a frequency domain loss function corresponding to a generator in an antagonism network for the target.

In this embodiment, the frequency domain loss function may be referred to as a multi-resolution short-time fourier transform loss, and may effectively capture the time-frequency distribution of the real speech waveform. Therefore, a small amount of network parameters can be used, the model training difficulty is reduced, the reasoning time is effectively shortened, and the perception quality of voice is improved.

In another implementation manner, the "constructing the second loss function corresponding to the generator in the target generation countermeasure network according to the difference between the generated audio and the expected generated audio" in the above 107 may be implemented by:

1074. and determining a time domain loss function according to the difference of the generated audio and the expected generated audio in the time domain.

Wherein the second loss function includes the time domain loss function.

In the above 1074, specifically, a time domain loss function is determined according to a difference between the time domain waveform map of the generated audio and the time domain waveform map of the desired generated audio.

The time domain loss function may capture the manhattan distance (also referred to as the L1 distance) between the real waveform point and the network output waveform point.

In yet another implementation, the second loss function is formed by the time domain loss function and the frequency domain loss function together. For example: the sum of the time domain loss function and the frequency domain loss function may be taken as a second loss function.

In the course of countermeasure training, the discrimination capability of the discriminator directly influences the training effect of the generator. The stronger the discrimination capability of the discriminator, the better the training effect of the generator will be. Thus, in order to improve the discrimination capability of the discriminator, as shown in fig. 5, the above-mentioned discriminator may include a plurality of sub-discriminators, the frequency resolution of the sixth spectrum for input to the different sub-discriminators to be discriminated is different; the audio to be distinguished is any one of the generated audio and expected generated audio in a training sample corresponding to the target generated countermeasure network; and constructing a loss function corresponding to the discriminator in the target generation countermeasure network according to the discrimination result of the audio to be discriminated by each of the plurality of sub-discriminators.

Specifically, the target generation countermeasure network arbiter includes: a plurality of discriminators; the plurality of discriminators are in one-to-one correspondence with a plurality of frequency resolutions; the method further comprises the steps of:

109. and carrying out short-time Fourier transform on the audio to be discriminated according to the transformation parameters corresponding to the frequency resolutions, so as to obtain a sixth frequency spectrum corresponding to the frequency resolutions.

That is, the above-described step 109 corresponds to step S21 in fig. 5.

110. And inputting the sixth frequency spectrums corresponding to the various frequency spectrum resolutions into the discriminators in a one-to-one correspondence manner so as to obtain discrimination results of the discriminators.

That is, the above-described step 110 corresponds to steps S22 and S23 in fig. 5.

111. And constructing a loss function corresponding to the discriminator in the target generation countermeasure network according to the discrimination results of the discriminators.

In the above 109 and 110, the plurality of discriminators includes: a first sub-arbiter; the first sub-arbiter is any one of the plurality of arbitrators; the first sub-discriminant corresponds to a first frequency resolution. Specifically, according to the transformation parameters corresponding to the first frequency resolution, performing short-time Fourier transform on the audio to be discriminated to obtain a sixth frequency spectrum corresponding to the first frequency resolution; and inputting a sixth frequency spectrum corresponding to the first frequency resolution into the first sub-discriminator so as to discriminate the audio to be discriminated, thereby obtaining a first discrimination result output by the first sub-discriminator.

In the above 111, the final discrimination result may be determined according to discrimination results of the plurality of discriminators, and the objective generating loss function corresponding to the discriminators in the countermeasure network may be constructed according to a difference between the final discrimination result and the desired discrimination result of the audio to be discriminated. The expected discrimination result of the audio to be discriminated is used for indicating the true and false conditions of the audio to be discriminated: when the audio to be judged is true, the audio to be judged is expected to generate audio (namely true audio) in the training sample; and when the audio to be judged is false, the audio to be judged is the generated audio.

All the sub-discriminators are completely non-causal convolutions, as shown in fig. 5, the convolution layers of the sub-discriminators are different, the larger the frequency resolution corresponding to the sub-discriminators is, the more the information is, the more the convolution layers are, and thus the capability of the network discriminator for discriminating synthesized or real audio can be enhanced.

The following describes a separation scheme of vocal accompaniment according to an embodiment of the present application with reference to fig. 6:

601. mixed audio is acquired.

602. STFT transformation is performed to obtain an amplitude spectrum and a phase spectrum of the mixed audio.

603. The amplitude spectrum and the phase spectrum of the mixed audio are input to a generator in a first generation reactance network.

604. And obtaining the amplitude spectrum and the phase spectrum of the accompaniment audio.

605. ISTFT conversion is carried out on the amplitude spectrum and the phase spectrum of the accompaniment audio so as to obtain the accompaniment audio.

606. The amplitude spectrum and the phase spectrum of the mixed audio are input to a generator in a second generation countermeasure network.

607. And obtaining the amplitude spectrum and the phase spectrum of the human voice audio.

608. ISTFT conversion is carried out on the amplitude spectrum and the phase spectrum of the voice frequency to obtain the voice frequency.

It should be noted that the above mixed audio may be mono audio or multi-channel audio. When the mixed audio is multi-channel audio, the frequency spectrums corresponding to the multi-channel audio can be input into the corresponding generators together as a plurality of channels, so that the network can better separate the multi-channel audio by utilizing the information of adjacent channels, meanwhile, the problem that multi-channel processing is troublesome is solved, and compared with a mono-channel separation algorithm, the efficiency of voice accompaniment separation is improved.

The mixed audio is decomposed into two parts, namely voice and accompaniment, similar to the conversion from image to image, so that the spectrogram of the mixed signal is converted into a time-frequency spectrogram of pure voice and a time-frequency spectrogram of pure accompaniment. The use of image-to-image conversion can create fine and accurate detail information required for high quality audio reproduction.

In addition, the output of the generator of the application can be a time spectrum mask, and the product of the time spectrum mask and the input time frequency spectrum of the generator is used as a corresponding time spectrum diagram of the audio.

The mixed audio may be a song or a mixed audio extracted from a video. Mixed audio extracted from video may be utilized with the multimedia processing tool FFmpeg.

Fig. 7 is a schematic flow chart of a network training method according to another embodiment of the present application. The execution subject of the method can be the client or the server. As shown in fig. 6, the method includes:

701. a first generated antagonism network and a second generated antagonism network are constructed.

702. And performing countermeasure training on the generator and the discriminator in the first generation countermeasure network by using the training samples corresponding to the first generation countermeasure network until the loss functions corresponding to the generator and the discriminator in the first generation countermeasure network are converged.

703. And performing countermeasure training on the generator and the discriminator in the second generated countermeasure network by using the training samples corresponding to the second generated countermeasure network until the loss functions corresponding to the generator and the discriminator in the second generated countermeasure network are converged.

In 701, after the first generation countermeasure network and the second generation countermeasure network are constructed, network parameters in the first generation countermeasure network and network parameters in the second generation countermeasure network may be initialized.

Specific implementations of the foregoing 702 and 703 may be referred to the corresponding content in the foregoing embodiments, and will not be described herein.

In the technical scheme provided by the embodiment of the application, two different generation countermeasure networks are utilized to respectively learn the accompaniment signal distribution and the voice signal distribution in the mixed audio, so that the targeted learning is realized; also, by generating countermeasure training for the countermeasure network, the generators in the generated countermeasure network can be made to learn a signal distribution that is more highly abstract. Therefore, the technical scheme provided by the embodiment of the application can improve the audio quality of finally separated human voice and accompaniment.

What needs to be explained here is: details of each step in the method provided in the embodiment of the present application may be referred to corresponding details in the above embodiment, which are not described herein. In addition, the method provided in the embodiment of the present application may further include other part or all of the steps in the above embodiments, and specific reference may be made to the corresponding content of the above embodiments, which is not repeated herein.

Table 1 is a scoring table of audience members for separating results of the separation method of human voice provided by the embodiment of the present application from the existing two separation methods of human voice accompaniment. A score of 1 indicates poor; the score 2 represents the difference; the 3-division indicates the general; 4, the score is well represented; a score of 5 is good, and finally an average value is taken, wherein the score range is between 1 and 5, and the higher the score is, the better the tone quality effect is.

Table 1:

the table shows that the voice accompaniment separation method based on the generation countermeasure network can effectively separate voice and accompaniment music in songs or video and audio, and provides a feasible solution for effectively utilizing music audio data.

The embodiment of the application respectively designs the models for extracting the voice and the accompaniment, so that the network model can capture the complex structure of the input signal and automatically learn the effective characteristic representation, thereby ensuring that the details of the accompaniment and the voice are better kept and the tone quality is higher; by using a large number of training samples, data-driven learning is performed so as to effectively inhibit noise and the like, and the anti-interference capability of the algorithm is relatively strong while the separation precision is improved.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic device includes a memory 1101 and a processor 1102. The memory 1101 may be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device. The Memory 1101 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as static random access Memory (Static RandomAccess Memory, SRAM), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read Only Memory), EEPROM), erasable programmable Read-Only Memory (Electrical Programmable Read Only Memory, EPROM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.

The memory 1101 is configured to store a program;

the processor 1102 is coupled to the memory 1101, and is configured to execute the program stored in the memory 1101, so as to implement the method 8 provided in the above method embodiments.

Further, as shown in fig. 8, the electronic device further includes: communication component 1103, display 1104, power component 1105, audio component 1106, and other components. Only some of the components are schematically shown in fig. 8, which does not mean that the electronic device only comprises the components shown in fig. 8.

Accordingly, the present application also provides a computer-readable storage medium storing a computer program, where the computer program is capable of implementing the steps or functions of the method provided by each method embodiment.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM (Read Only Memory)/RAM (RandomAccess Memory ), magnetic disk, optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of separating vocal accompaniment, comprising:

2. The method of claim 1, wherein for the mixed audio, generating, with a generator in a first generation reactance network, target audio corresponding to the accompaniment comprises:

generating a second frequency spectrum corresponding to the accompaniment by utilizing a generator in a first generation reactance network according to the first frequency spectrum of the mixed audio;

and generating target audio corresponding to the accompaniment according to the second frequency spectrum corresponding to the accompaniment.

3. The method of claim 2, wherein the generator in the first generation-reactance network comprises: the device comprises a high-low frequency separation module, a high-frequency feature extraction module, a low-frequency feature extraction module and a high-low frequency fusion module;

generating a second spectrum corresponding to the accompaniment by using a generator in a first generation reactance network according to the first spectrum of the mixed audio, comprising:

according to the first frequency spectrum, the high-frequency characteristic and the low-frequency characteristic corresponding to the mixed audio are separated by utilizing the high-frequency and low-frequency separation module;

carrying out accompaniment feature extraction on the high-frequency features by utilizing the high-frequency feature extraction module to obtain accompaniment high-frequency features;

carrying out accompaniment feature extraction on the low-frequency features by utilizing the low-frequency feature extraction module to obtain accompaniment low-frequency features;

And carrying out feature fusion on the accompaniment high-frequency features and the accompaniment low-frequency features by using the high-frequency and low-frequency fusion module so as to obtain a second frequency spectrum corresponding to the accompaniment.

4. A method according to any one of claims 1 to 3, wherein for the mixed audio, generating target audio corresponding to the human voice with a generator in a second generation countermeasure network comprises:

generating a third frequency spectrum corresponding to the human voice by using a generator in a second generation countermeasure network according to the first frequency spectrum of the mixed audio;

and generating target audio corresponding to the voice according to the third frequency spectrum corresponding to the voice.

5. A method according to any one of claims 1 to 3, further comprising:

performing countermeasure training on a generator and a discriminator in the target generation countermeasure network according to training samples corresponding to the target generation countermeasure network until loss functions corresponding to the generator and the discriminator in the target generation countermeasure network converge;

6. The method of claim 5, wherein the target generating training samples corresponding to the countermeasure network comprises: sample mix audio and desired generated audio; the method further comprises the steps of:

Constructing a first loss function corresponding to a generator in the target generation countermeasure network according to a discrimination result of a discriminator in the target generation countermeasure network aiming at the generated audio; the generated audio is generated by a generator in an antagonism network by utilizing the target for the sample mixed audio;

constructing a second loss function corresponding to a generator in the target generation countermeasure network according to the difference between the generated audio and the expected generated audio;

and determining that the target generates a loss function corresponding to a generator in an antagonism network according to the first loss function and the second loss function.

7. The method of claim 6, wherein constructing a second loss function corresponding to a generator in the target generation countermeasure network based on a difference between the generated audio and the desired generated audio, comprises:

based on the transformation parameters corresponding to the multiple frequency resolutions, respectively performing short-time Fourier transformation on the generated audio to obtain fourth frequency spectrums corresponding to the multiple frequency resolutions;

based on the transformation parameters corresponding to the multiple frequency resolutions, respectively performing short-time Fourier transformation on the expected generated audio to obtain fifth frequency spectrums corresponding to the multiple frequency resolutions;

Constructing a frequency domain loss function corresponding to a generator in the target generation countermeasure network according to the difference between a fourth frequency spectrum and a fifth frequency spectrum corresponding to each frequency resolution in the plurality of frequency resolutions; wherein the second loss function includes the frequency domain loss function.

8. The method of claim 6, wherein constructing a second loss function corresponding to a generator in the target generation countermeasure network based on a difference between the generated audio and the desired generated audio, comprises:

determining a time domain loss function according to the difference of the generated audio and the expected generated audio in the time domain; wherein the second loss function includes the time domain loss function.

9. The method of claim 5, wherein the target generating a arbiter in an antagonism network comprises: a plurality of discriminators; the plurality of discriminators are in one-to-one correspondence with a plurality of frequency resolutions; the method further comprises the steps of:

performing short-time Fourier transform on the audio to be discriminated according to the transformation parameters corresponding to the various frequency resolutions, and obtaining sixth frequency spectrums corresponding to the various frequency resolutions;

Inputting the sixth frequency spectrums corresponding to the various frequency spectrum resolutions into the discriminators in a one-to-one correspondence manner so as to obtain discrimination results of the discriminators;

and constructing a loss function corresponding to the discriminator in the target generation countermeasure network according to the discrimination results of the discriminators.

10. A method according to any one of claims 1 to 3, wherein the training samples corresponding to the first generation-impedance network include first sample mixed audio and expected generation audio corresponding to accompaniment in the first sample mixed audio;

the training samples corresponding to the second generation countermeasure network comprise second sample mixed audio and expected generation audio corresponding to human voice in the second sample mixed audio.

11. A method of network training, comprising:

12. An electronic device, comprising: a memory and a processor, wherein,

the memory is used for storing programs;

the processor, coupled to the memory, for executing the program stored in the memory to implement the method of any one of claims 1 to 11.

13. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a computer, is capable of implementing the method of any one of claims 1 to 11.