CN111370019A

CN111370019A - Sound source separation method and device, and model training method and device of neural network

Info

Publication number: CN111370019A
Application number: CN202010136342.1A
Authority: CN
Inventors: 孔秋强; 王雨轩
Original assignee: ByteDance Inc
Current assignee: ByteDance Inc
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2020-07-03
Anticipated expiration: 2040-03-02
Also published as: CN111370019B

Abstract

A sound source separation method, a neural network model training method, a sound source separation device, a neural network model training device, and a storage medium. The sound source separation method includes: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source label group; and inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group correspond to the condition vectors of the condition vector group one by one.

Description

Sound source separation method and device, and model training method and device of neural network

Technical Field

Embodiments of the present disclosure relate to a sound source separation method, a neural network model training method, a sound source separation device, a neural network model training device, and a storage medium.

Background

Sound source separation is a technique for separating sound sources in a sound recording. Sound source separation is the basis of a Computational Audio Scene Analysis (CASA) system. In essence, the CASA system aims to separate sound sources in mixed audio in the same way as human listeners. The CASA system may detect and separate the mixed audio to obtain different sound sources. Because of the large number of sound events in the world, multiple different sound events may occur simultaneously, leading to the well-known cocktail party problem. The sound source separation can be performed using an unsupervised method of mean harmonic structure modeling, a neural network-based method, and the like. The neural network method includes a fully connected neural network, a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

At least one embodiment of the present disclosure provides a sound source separation method, including: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source label group; and inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain a target sound source group, wherein target sound sources in the target sound source group correspond to condition vectors of the condition vector group one by one.

At least one embodiment of the present disclosure further provides a method for training a neural network model, including: obtaining a training sample set, wherein the training sample set comprises a plurality of training data sets, each training data set comprises training mixed audio, a plurality of training audio segments and a plurality of first training condition vectors, the training mixed audio comprises the plurality of training audio segments, and the plurality of first training condition vectors are in one-to-one correspondence with the plurality of training audio segments; training a first neural network to be trained by using the training sample set to obtain a first neural network, wherein the first neural network to be trained comprises a loss function, and training the first neural network to be trained by using the training sample set to obtain the first neural network comprises: obtaining a current training data set from the training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio segments, and the current training mixed audio comprises the plurality of current training audio segments; determining a plurality of first current training condition vectors which are in one-to-one correspondence with the plurality of current training audio segments, wherein the current training data set further comprises the plurality of first current training condition vectors, and inputting the current training mixed audio and the plurality of first current training condition vectors into the first neural network to be trained for sound source separation processing to obtain a plurality of first current training target sound sources; calculating a first loss value of a loss function of the first neural network to be trained according to the first current training target sound sources and the current training audio segments; and correcting parameters of the first neural network to be trained according to the first loss value, obtaining the trained first neural network when the loss function meets a preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

At least one embodiment of the present disclosure further provides a sound source separation apparatus, including: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer readable instructions, the computer readable instructions being executed by the processor to perform the sound source separation method according to any of the above embodiments.

At least one embodiment of the present disclosure further provides a model training apparatus, including: a memory for non-transitory storage of computer readable instructions; and a processor for executing the computer-readable instructions, wherein the computer-readable instructions, when executed by the processor, perform the model training method according to any of the above embodiments.

At least one embodiment of the present disclosure also provides a storage medium that non-transitory stores computer readable instructions that when executed by a computer can perform the sound source separation method according to any one of the above embodiments.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a schematic flow chart of a sound source separation method according to at least one embodiment of the present disclosure;

fig. 2 is a schematic diagram of a first neural network according to at least one embodiment of the present disclosure;

fig. 3 is a schematic flow chart of a method for training a neural network according to at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a first current training audio clip, a first current training audio anchor vector corresponding to the first current training audio clip, a second current training audio clip, and a second current training audio anchor vector corresponding to the second current training audio clip, according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of separating a sound source from a mixture of a first current training audio segment and a second current training audio segment according to at least one embodiment of the present disclosure;

fig. 6 is a schematic block diagram of a sound source separation apparatus according to at least one embodiment of the present disclosure;

fig. 7 is a schematic block diagram of a model training apparatus according to at least one embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of a storage medium provided in at least one embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Currently, most sound source separation systems are designed to separate a certain sound source type, such as speech, music, etc. The CASA system requires a large number of sound sources to be separated, i.e., an open sound source separation problem. There may be hundreds of sound source types in the real world, greatly increasing the difficulty of separating all of these sound sources. Existing sound source separation systems require training in pairs of clean sounds and mixed sounds that include the clean sounds. For example, to separate human voice from music, it is necessary to train with a pair of mixed voice (mixed voice includes a mixture of human voice and music) and pure human voice. However, no data set currently provides clean sound for a large number of sound source types. For example, it is impractical to collect pure natural sounds (e.g., thunder, etc.) because natural sounds are often mixed with other sounds.

At least one embodiment of the present disclosure provides a sound source separation method, a model training method of a neural network, a sound source separation apparatus, a model training apparatus of a neural network, and a storage medium. The sound source separation method includes: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source label group; and inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group correspond to the condition vectors of the condition vector group one by one.

In the sound source separation method, the condition vectors corresponding to different sound sources are added, so that various sound sources corresponding to the condition vectors in the same mixed audio frequency are separated, and the problem of sound source separation of a large number of sound source types trained by using weak label data is solved.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings, but the present disclosure is not limited to these specific embodiments. To maintain the following description of the embodiments of the present disclosure clear and concise, a detailed description of some known functions and components have been omitted from the present disclosure.

The regression-based sound source separation technique is briefly described below.

The regression method based on the neural network can be used for solving the problems of sound source separation or voice enhancement and the like. The sound source separation method provided by the embodiment of the disclosure also belongs to a regression-based method. Regression-based methods learn the mapping from mixed sound sources to target sound sources to be separated. For example, the individual sound sources can be represented as: s1, s2, …, sk, where k is the number of sound sources and is a positive integer, each sound source (s1 or s2 or sk) is represented as a time-domain signal (a time-domain signal is a signal that describes a mathematical function or a physical signal versus time, e.g., a time-domain waveform of a signal expresses the change of the signal over time). The mixed sound source is represented as:

the current sound source separation system needs to establish a regression map f (x) for each individual sound source sk:

where x is a mixed sound source. For the speech enhancement task, sk is the target clean speech. For the sound source separation task, sk may be a music part or an accompaniment part. The regression map F (x) may be modeled in the waveform and time-frequency (T-F) domain. For example, F (x) can be constructed in the T-F domain.

For example, the mixed sound source X and each individual sound source sk may be converted into X and S, respectively, using a Short Time Fourier Transform (STFT), the size and phase of X being denoted as | X | and e^j∠XAnd | X | is called a spectrogram of the mixed sound source X. The spectrogram | X | of the mixed sound source can then be mapped to the predicted spectrogram of the individual sound source using a neural network

The estimated individual sound sources are then determined using the phases in the mixed sound sources:

finally, inverse STFT is applied to

Thereby obtaining separate individual sound sources s. However, current sound source separation systems require clean sound sources for training. Moreover, each sound source separation system is onlyOne sound source type can be separated, whereby the number of sound source separation systems will increase linearly with the number of sound sources, which is impractical for separating all sound source types in an AudioSet (AudioSet is a large scale weak label audio data set). And the audio clips in AudioSet are from a truly recorded video (e.g., YouTube video). Each audio clip may include a number of different sound events, with no clean sound source in the AudioSet data set. An audio segment selected based on a Sound Event Detection (SED) system only indicates the presence of a sound event, whereas there may be a plurality of different sound events within the same audio segment, whereby training of the sound source separation system is difficult.

Fig. 1 is a schematic flow chart of a sound source separation method according to at least one embodiment of the present disclosure, and as shown in fig. 1, the sound source separation method includes steps S10-S13.

For example, as shown in fig. 1, an embodiment of the present disclosure provides a sound source separation method including:

step S10: acquiring mixed audio;

step S11: determining a sound source tag group corresponding to the mixed audio;

step S12: determining a condition vector group according to the sound source label group;

step S13: and inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group correspond to the condition vectors of the condition vector group one by one.

In the sound source separation method provided by the embodiment of the present disclosure, a regression method based on a neural network separates a large number of different sound sources by setting condition vectors corresponding to the different sound sources.

For example, in step S10, the mixed audio may include various sounds mixed, and the various sounds may include human speech, singing, natural thunder and rain, musical performance sounds of musical instruments, and the like. For example, the various sounds may be collected by a sound collection device and may be stored using a storage device or the like. For example, mixed audio may also be derived from an AudioSet data set.

For example, in some examples, the mixed audio may include at least two different types of sound.

For example, in some examples, step S10 may include: acquiring original mixed audio; and carrying out spectrum transformation processing on the original mixed audio to obtain the mixed audio.

For example, mixed audio may be represented as a spectrogram of sound.

For example, the original mixed audio may be audio directly captured with a sound capture device. The sound collection device may include various types of microphones, microphone arrays, or other devices capable of collecting sound. The microphone may be an electret capacitor microphone or a micro-electromechanical system (MEMS) microphone or the like.

For example, in some examples, spectrally transforming the original mixed audio to obtain the mixed audio includes: and carrying out short-time Fourier transform processing on the original mixed audio to obtain the mixed audio.

For example, a short-time fourier transform (STFT) may convert the original mixed audio X into X, the size of X being represented as | X |, which is a short-time fourier spectrogram of sound, i.e., mixed audio.

For example, the sound level heard by human ears is not linear with the actual frequency, and the Mel (Mel) frequency is more consistent with the hearing characteristic of human ears, i.e. the Mel frequency is linearly distributed below 1000Hz, and the Mel frequency is logarithmically increased above 1000 Hz. Thus, in other examples, performing spectral transform processing on the original mixed audio to obtain the mixed audio includes: carrying out short-time Fourier transform processing on the original mixed audio to obtain intermediate mixed audio; the intermediate mixed audio is subjected to logarithmic mel-frequency spectrum processing to obtain mixed audio.

In this case, the intermediate mixed audio represents a short-time fourier spectrogram of the sound, and the mixed audio represents a logarithmic mel-frequency spectrogram of the sound.

It should be noted that the original mixed audio may include various noises, so that besides the spectrum transformation processing of the original mixed audio, the speech enhancement, noise reduction, etc. processing may be performed on the original mixed audio to eliminate irrelevant information or noise information in the original mixed audio, thereby obtaining a mixed audio with better quality.

For example, step S11 may include: performing sound event detection on the mixed audio by using a second neural network to determine a sound event group included in the mixed audio; a set of sound source tags is determined from the set of sound events.

For example, the number of sound events in the sound event group is equal to or greater than the number of sound source tags in the sound source tag group.

For example, a sound event group may include a plurality of sound events, and a sound source label group may include a plurality of sound source labels, the plurality of sound labels being different from one another. In some examples, the plurality of sound events are different and respectively correspond to different sound source types, in this case, the number of the plurality of sound source tags is equal to the number of the plurality of sound events, and the plurality of sound source tags correspond to the plurality of sound events one to one. In other examples, the plurality of sound events are different from each other, however, the sound source types corresponding to a part of the plurality of sound events may be the same, so that the part of the sound events corresponds to the same sound source tag, and in this case, the number of the plurality of sound source tags is smaller than the number of the plurality of sound events.

It should be noted that, in the present disclosure, if the sound source types corresponding to two sound events are the same, and the occurrence times corresponding to the two sound events are different, the two sound events are considered as two different sound events; if the sound source types corresponding to the two sound events are different and the occurrence times corresponding to the two sound events are the same, the two sound events are considered as two different sound events.

For example, each sound source label represents a sound source type of the corresponding one or more sound events. The sound source types may include: human speech, singing, natural thunder and rain, playing of various instruments, singing of various animals, sounds of machines, etc. It should be noted that the voice of all people speaking can be classified into a sound source type; the voices of different animals can be different sound source types, for example, the voice of a tiger and the voice of a monkey can be different sound source types respectively; the performance sounds of different instruments may be different sound source types, for example, piano sounds and violin sounds are different sound source types, respectively.

It should be noted that, in some embodiments, the mixed audio provided by the present disclosure may also be a clean sound source, that is, only one sound source is included, in this case, the sound event group includes only one sound event, and accordingly, the sound source tag group includes only one sound source tag corresponding to the sound event.

For example, the second neural network may detect the mixed audio using Sound Event Detection (SED). The second neural network may be any suitable network, such as a convolutional neural network, for example, the second neural network may be a convolutional neural network comprising 13 convolutional layers, for example, each convolutional layer may comprise 3 x 3 convolutional kernels. For example, in some examples, the second neural network may be AlexNet, VGGnet, ResNet, or the like.

For example, in step S12, in some examples, the condition vector group includes at least one condition vector, the sound source label group includes at least one sound source label, the number of the at least one sound source label is the same as the number of the at least one condition vector, and the at least one condition vector corresponds to the at least one sound source label one to one. Each of the at least one condition vector includes N type probability values, that is, the sound source separation method provided by the embodiments of the present disclosure may identify and separate N sound source types. The type probability value corresponding to the sound source type corresponding to the sound source label corresponding to each condition vector in the N type probability values is a target type probability value, the target type probability value is 1, and the other type probability values except the target type probability value in the N type probability values are all 0, wherein N is a positive integer. Setting the condition vector in this manner makes it possible to separate sound sources each of which is clean.

For example, for an AudioSet data set, which has a large-scale audio data set of 527 sound source types, N is 527. That is, each condition vector may be a one-dimensional vector having 527 elements.

For example, in some embodiments, N sound source types correspond to N sound source tags one to one, and the N sound source tags may include a first sound source tag, a second sound source tag, …, and an nth sound source tag, and the N sound source types include a first sound source type corresponding to the first sound source tag, a second sound source type corresponding to the second sound source tag, …, and an nth sound source type corresponding to the nth sound source tag. The N type probability values in each condition vector may include a first type probability value, a second type probability value, …, an ith type probability value, …, and an nth type probability value, and each condition vector may be represented as P ═ { P1, P2, …, pi, …, pN }, i.e., P1 represents the first type probability value, P2 represents the second type probability value, pi represents the ith type probability value, cN represents the nth type probability value, where i is a positive integer, and 1 ≦ i ≦ N. The first type probability value represents the probability of the first sound source type corresponding to the first sound source label, the second type probability value represents the probability of the second sound source type corresponding to the second sound source label, and the like, and the Nth type probability value represents the probability of the Nth sound source type corresponding to the Nth sound source label. In some examples, the condition vector group includes a first condition vector and a second condition vector, and if the first condition vector corresponds to a first sound source tag, in this case, in the first condition vector, the first type probability value is a target type probability value, p1 is 1, and the remaining type probability values except the first type probability value are 0, that is, p2 is 0, pi is 0, and pN is 0; if the second condition vector corresponds to the nth sound source tag, then in the second condition vector, the nth type probability value is the target type probability value, pN is 1, and the remaining type probability values except the nth type probability value are all 0, i.e., p1 is 0, p2 is 0, and pi is 0.

For example, in some examples, the condition vector group includes a plurality of condition vectors, and at this time, step S13 may include: determining a plurality of input data sets based on the mixed audio and the plurality of condition vectors; and respectively carrying out sound source separation processing on the plurality of input data sets by utilizing a first neural network to obtain a target sound source set.

For example, the plurality of input data sets correspond one-to-one to the plurality of condition vectors, each of the plurality of input data sets including mixed audio and one of the plurality of condition vectors. That is, the first neural network includes two inputs, respectively, mixed audio and a condition vector. For example, in some examples, the plurality of input data sets includes a first input data set and a second input data set, the plurality of condition vectors includes a first condition vector and a second condition vector, the first input data set includes the first condition vector and mixed audio, and the second input data set includes the second condition vector and mixed audio.

For example, the target sound source group includes a plurality of target sound sources corresponding to a plurality of condition vectors one to one, the plurality of input data groups correspond to a plurality of target sound sources one to one, and each of the plurality of target sound sources corresponds to a condition vector in the input data group corresponding to each of the target sound sources. For example, the plurality of target sound sources includes a first target sound source and a second target sound source, the first target sound source corresponds to the first input data set, the second target sound source corresponds to the second input data set, that is, the first target sound source corresponds to the first condition vector, the second target sound source corresponds to the second condition vector, if the sound source type corresponding to the first condition vector is human sound, and the sound source type corresponding to the second condition vector is piano sound, the first target sound source is human sound, and the second target sound source is piano sound.

For example, the plurality of condition vectors are different from each other, and thus the plurality of target sound sources are different from each other.

For example, each target sound source may comprise at least one audio clip. In some examples, the mixed audio includes a group of sound events including M sound events, where M is a positive integer. For example, M sound events respectively correspond to different sound source types, at this time, the mixed audio is processed to obtain a target sound source group, where the target sound source group includes M target sound sources, and each target sound source of the M target sound sources includes an audio segment. For another example, Q sound events of the M sound events correspond to the same sound source type, where Q is a positive integer and Q is less than or equal to M, then the mixed audio is processed to obtain a target sound source group, where the target sound source group includes (M-Q +1) target sound sources, Q sound events correspond to one target sound source, and the target sound source may include Q audio segments.

For example, the time length of each audio piece in the target sound source may be set by the user. For example, the time length of each audio piece may be 2 seconds(s). If each target sound source includes a plurality of audio segments, the time lengths of the plurality of audio segments may be the same.

Fig. 2 is a schematic diagram of a first neural network according to at least one embodiment of the present disclosure.

For example, the first neural network is a U-shaped neural network (U-net), a convolutional neural network, or any other suitable network.

For example, in some examples, the first neural network is U-net, and the U-net includes four encoding layers and four decoding layers, each encoding layer consisting of two convolutional layers and one pooling layer, and each decoding layer consisting of two transposed convolutional layers and one anti-pooling layer. U-net is a variant of convolutional neural networks. The U-Net is composed of a plurality of encoding layers and a plurality of decoding layers which are modeled by convolutional layers. Each coding layer halves the size of the feature map, but doubles the number of channels, i.e. encodes the spectrogram into a feature map of smaller depth representation. Each decoding layer decodes the feature map into a spectrogram of the original size by transposing the convolutional layer. In U-Net, a connection is added between the coding layer and the decoding layer at the same hierarchical level, and here, the connection may be, for example, merge concatenate, that is, a feature map connection with the same size in the layer where the coding layer and the decoding layer are connected by means of memory mapping (so that vectors corresponding to features are merged and the number of channels of the layer where the features are located is doubled), thereby allowing low-level information to flow directly from a high-resolution input stream to a high-resolution output stream, that is, combining the low-level information with a high-level depth representation.

For example, as shown in fig. 2, the four encoding layers are a first encoding layer 20, a second encoding layer 21, a third encoding layer 22, and a fourth encoding layer 23, respectively, and the four decoding layers are a first decoding layer 30, a second decoding layer 31, a third decoding layer 32, and a fourth decoding layer 33, respectively. The first code layer 20 includes a convolutional layer CN11, a convolutional layer CN12 and a pooling layer PL11, the second code layer 21 includes a convolutional layer CN21, a convolutional layer CN22 and a pooling layer PL12, the third code layer 22 includes a convolutional layer CN31, a convolutional layer CN32 and a pooling layer PL13, and the fourth code layer 23 includes a convolutional layer CN41, a convolutional layer CN42 and a pooling layer PL 14. The first decoding layer 30 includes a transposed convolution layer TC11, a transposed convolution layer TC12, and an inverse pooling layer UP11, the second decoding layer 31 includes a transposed convolution layer TC21, a transposed convolution layer TC22, and an inverse pooling layer UP12, the third decoding layer 32 includes a transposed convolution layer TC31, a transposed convolution layer TC32, and an inverse pooling layer UP13, and the fourth decoding layer 33 includes a transposed convolution layer TC41, a transposed convolution layer TC42, and an inverse pooling layer UP 14.

For example, the first encoding layer 20 is connected to the first decoding layer 30, the second encoding layer 21 is connected to the second decoding layer 31, the third encoding layer 22 is connected to the third decoding layer 32, and the fourth encoding layer 23 is connected to the fourth decoding layer 33.

For example, in some examples, the number of channels of one input of the first neural network is 16, as shown in fig. 2, convolutional layer CN11 in the first encoding layer 20 is used to extract features of the input to generate a feature map F11; the convolutional layer CN12 in the first code layer 20 is used for performing a feature extraction operation on the feature map F11 to obtain a feature map F12; the pooling layer PL11 in the first encoding layer 20 is used to perform a downsampling operation on the feature map F12 to obtain the feature map F13.

For example, the feature map F12 may be transmitted to the first decoding layer 30.

For example, the number of channels of the feature map F11, the number of channels of the feature map F12, and the number of channels of the feature map F13 are all the same, for example, 64. For example, in some examples, the dimensions of feature F11 are the same as the dimensions of feature F12, and are both larger than the dimensions of feature F13, e.g., the dimensions of feature F11 are four times the dimensions of feature F13; in other examples, the dimensions of feature F11 are greater than the dimensions of feature F12, the dimensions of feature F12 are greater than the dimensions of feature F13, and the dimensions of feature F12 are four times the dimensions of feature F13, e.g., the dimensions of feature F11 are 570 x 570, the dimensions of feature F12 are 568 x 568, and the dimensions of feature F13 are 284 x 284.

For example, as shown in fig. 2, the convolutional layer CN21 in the second code layer 21 is used to extract the features of the feature map F13 to generate a feature map F21; the convolutional layer CN12 in the second code layer 21 is used for performing a feature extraction operation on the feature map F21 to obtain a feature map F22; the pooling layer PL12 in the second encoding layer 21 is used to perform a downsampling operation on the feature map F22 to obtain the feature map F23.

For example, the feature map F22 may be transmitted to the second decoding layer 31.

For example, the number of channels of the feature map F21, the number of channels of the feature map F22, and the number of channels of the feature map F23 are all the same, for example, 128. For example, in some examples, the dimensions of feature F21 are the same as the dimensions of feature F22, and are both larger than the dimensions of feature F23. For example, the size of feature F21 is four times the size of feature F23; in other examples, the dimensions of feature F21 are greater than the dimensions of feature F22, the dimensions of feature F22 are greater than the dimensions of feature F23, and the dimensions of feature F22 are, for example, four times the dimensions of feature F23.

For example, as shown in FIG. 2, the convolutional layer CN31 in the third code layer 22 is used to extract the features of the feature map F23 to generate a feature map F31; the convolutional layer CN32 in the third coding layer 22 is used for performing a feature extraction operation on the feature map F31 to obtain a feature map F32; the pooling layer PL13 in the third encoding layer 22 is used to perform a downsampling operation on the feature map F32 to obtain the feature map F33.

For example, the feature map F32 may be transmitted to the third decoding layer 32.

For example, the number of channels of the feature map F31, the number of channels of the feature map F32, and the number of channels of the feature map F33 are all the same, for example, 256. For example, in some examples, the dimensions of feature F31 are the same as the dimensions of feature F32, and are both larger than the dimensions of feature F33. For example, the size of feature F31 is four times the size of feature F33; in other examples, the dimensions of feature F31 are greater than the dimensions of feature F32, the dimensions of feature F32 are greater than the dimensions of feature F33, and the dimensions of feature F32 are, for example, four times the dimensions of feature F33.

For example, as shown in fig. 2, the convolutional layer CN41 in the fourth code layer 23 is used to extract the features of the feature map F33 to generate a feature map F41; the convolutional layer CN42 in the fourth code layer 23 is used for performing a feature extraction operation on the feature map F41 to obtain a feature map F42; the pooling layer PL14 in the fourth encoding layer 23 is used to perform a downsampling operation on the feature map F42 to obtain the feature map F43.

For example, the feature map F42 may be transmitted to the fourth decoding layer 33.

For example, the number of channels of the feature map F41, the number of channels of the feature map F42, and the number of channels of the feature map F43 are all the same, for example, 512. For example, in some examples, the dimensions of feature F41 are the same as the dimensions of feature F42, and are both larger than the dimensions of feature F43. For example, the size of feature F41 is four times the size of feature F43; in other examples, the dimensions of feature F41 are greater than the dimensions of feature F42, the dimensions of feature F42 are greater than the dimensions of feature F43, and the dimensions of feature F42 are, for example, four times the dimensions of feature F43.

For example, as shown in fig. 2, the first neural network further comprises an encoding output layer 25 and a decoding input layer 26, the encoding output layer 25 may be connected with the decoding input layer 26, the encoding output layer 25 is further connected with the fourth encoding layer 23, and the decoding input layer 26 is further connected with the fourth decoding layer 33. The encoding output layer 25 includes a convolutional layer CN51, and the decoding input layer 26 includes a convolutional layer CN 52. The convolutional layer CN51 in the encoding output layer 25 is used to perform a feature extraction operation on the feature map F43 to generate a feature map F51. The feature map F51 is output to the decoding input layer 26, and the convolutional layer CN52 in the decoding input layer 26 is used to perform a feature extraction operation on the feature map F51 to obtain a feature map F52.

For example, the number of channels of the feature map F51 may be 1024, and the number of channels of the feature map F52 may be 512. For example, in some examples, the dimensions of feature F43 are greater than the dimensions of feature F51, the dimensions of feature F51 are greater than the dimensions of feature F52; for another example, in other examples, the dimensions of feature F43, feature F51, and feature F52 are the same.

For example, as shown in fig. 2, the inverse pooling UP14 of the fourth decoding layer 33 is used to perform an upsampling operation on the feature map F52 to obtain a feature map F53; the signature F53 and the signature F42 transmitted by the fourth encoding layer 23 may be combined, and then the transposed convolutional layer TC41 of the fourth decoding layer 33 performs a deconvolution operation on the combined signature F53 and signature F42 to obtain a signature F61; the transposed convolutional layer TC42 of the fourth decoding layer 33 is used to perform a deconvolution operation on the feature map F61 to obtain and output a feature map F62 to the third decoding layer 32.

For example, the number of channels of feature F42, the number of channels of feature F53, and the number of channels of feature F61 may be the same, e.g., all 512, while the number of channels of feature F62 may be 256. For example, in some examples, the dimensions of feature F42, feature F53, feature F61, and feature F62 may be the same; in other examples, the dimensions of feature F42 are greater than the dimensions of feature F53, feature F53 are greater than the dimensions of feature F61, and feature F61 is greater than the dimensions of feature F62, e.g., feature F42 is represented by 64 x 64, feature F53 is represented by 56 x 56, feature F61 is represented by 54 x 54, and feature F62 is represented by 52 x 52.

For example, as shown in fig. 2, the inverse pooling UP13 of the third decoding layer 32 is used to perform an upsampling operation on the feature map F62 to obtain a feature map F63; the signature F63 and the signature F32 transmitted by the third encoding layer 22 may be combined, and then the transposed convolutional layer TC31 of the third decoding layer 32 performs a deconvolution operation on the combined signature F63 and signature F32 to obtain a signature F71; the transposed convolutional layer TC32 of the third decoding layer 32 is used to perform deconvolution operation on the feature map F71 to obtain and output the feature map F72 to the second decoding layer 31.

For example, the number of channels for profile F32, the number of channels for profile F63, and the number of channels for profile F71 may be the same, e.g., 256 for each, while the number of channels for profile F72 is 128. For example, in some examples, the dimensions of feature F32, feature F63, feature F71, and feature F72 may be the same; in other examples, the dimensions of feature F32 are greater than the dimensions of feature F63, the dimensions of feature F63 are greater than the dimensions of feature F71, and the dimensions of feature F71 are greater than the dimensions of feature F72.

For example, as shown in fig. 2, the inverse pooling UP12 of the second decoding layer 31 is used to perform an upsampling operation on the feature map F72 to obtain a feature map F73; the signature F73 and the signature F22 transmitted by the second coding layer 21 may be combined, and then the transposed convolutional layer TC21 of the second decoding layer 31 performs a deconvolution operation on the combined signature F73 and signature F22 to obtain a signature F81; the transposed convolutional layer TC22 of the second decoding layer 31 is used to perform deconvolution operation on the feature map F81 to obtain and output the feature map F82 to the first decoding layer 30.

For example, the number of channels for profile F22, the number of channels for profile F73, and the number of channels for profile F81 may be the same, e.g., each is 128, while the number of channels for profile F82 is 64. For example, in some examples, the dimensions of feature F22, feature F73, feature F81, and feature F82 may be the same; in other examples, the dimensions of feature F22 are greater than the dimensions of feature F73, the dimensions of feature F73 are greater than the dimensions of feature F81, and the dimensions of feature F81 are greater than the dimensions of feature F82.

For example, as shown in fig. 2, the inverse pooling UP11 of the first decoding layer 30 is used to perform an upsampling operation on the feature map F82 to obtain a feature map F83; the signature F83 and the signature F12 transmitted by the first encoding layer 20 may be combined, and then the transposed convolutional layer TC11 of the first decoding layer 30 performs a deconvolution operation on the combined signature F83 and signature F12 to obtain a signature F91; the transposed convolutional layer TC12 of the first decoding layer 30 is used to perform a deconvolution operation on the feature map F91 to obtain and output a feature map F92.

For example, the number of channels for profile F12, the number of channels for profile F83, and the number of channels for profile F91 may be the same, e.g., 64 for each, while the number of channels for profile F92 is 32. For example, in some examples, the dimensions of feature F12, feature F83, feature F91, and feature F92 may be the same; in other examples, the dimensions of feature F12 are greater than the dimensions of feature F83, the dimensions of feature F83 are greater than the dimensions of feature F91, and the dimensions of feature F91 are greater than the dimensions of feature F92.

For example, as shown in fig. 2, in some examples, the first neural network further includes an output layer including convolutional layer CN6, which convolutional layer CN6 may include one 1 × 1 convolution kernel. Convolutional layer CN6 is used to perform a convolution operation on feature map F92 to obtain the output of the first neural network, and the number of channels of the output of the first neural network may be 1.

It should be noted that, for the detailed description of the structure, function, etc. of the U-net, reference may be made to the related contents of the U-net in the prior art, and the detailed description is not provided herein.

Still other embodiments of the present disclosure provide a method for model training of a neural network. Fig. 3 is a schematic flowchart of a model training method for a neural network according to at least one embodiment of the present disclosure.

For example, as shown in fig. 3, a method for training a neural network model provided in an embodiment of the present disclosure includes:

step S20: acquiring a training sample set;

step S21: and training the first neural network to be trained by utilizing the training sample set to obtain the first neural network.

For example, the first neural network obtained by using the model training method provided by the embodiments of the present disclosure may be applied to the sound source separation method described in any of the embodiments above.

For example, in step S20, the training sample set includes a plurality of training data sets, each of the training data sets includes a training mixed audio, a plurality of training audio segments, and a plurality of first training condition vectors, the training mixed audio includes a plurality of training audio segments, and the plurality of first training condition vectors correspond to the plurality of training audio segments one to one.

For example, the plurality of training audio segments are different from each other, and the types of sound sources corresponding to the plurality of training audio segments are different from each other. The time lengths of the plurality of training audio segments are all the same.

For example, the training mixed audio is obtained by mixing a plurality of training audio segments, and the time length of the training mixed audio may be the sum of the time lengths of the plurality of training audio segments.

For example, the plurality of first training condition vectors are different from each other.

For example, in step S21, the first neural network to be trained includes a loss function.

For example, in some embodiments, step S21 may include: acquiring a current training data set from a training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio segments, and the current training mixed audio comprises a plurality of current training audio segments; determining a plurality of first current training condition vectors in one-to-one correspondence with the plurality of current training audio segments, wherein the current training data set further comprises the plurality of first current training condition vectors; inputting the current training mixed audio and a plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of first current training target sound sources; calculating a first loss value of a loss function of a first neural network to be trained according to a plurality of first current training target sound sources and a plurality of current training audio segments; and correcting parameters of the first neural network to be trained according to the first loss value, obtaining the trained first neural network when the loss function meets a preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, the current training mixed audio may be obtained by mixing a plurality of current training audio segments, and the time length of the current training mixed audio may be the sum of the time lengths of the plurality of current training audio segments.

For example, the plurality of current training audio segments are all the same length in time. For example, the temporal length of each current training audio segment may be 2 seconds, 3 seconds, and so on. The time length of each current training audio segment may be set by a user, and the time length of each current training audio segment is not limited by the present disclosure.

For example, in some examples, the plurality of current training audio segments are different from each other, and the sound source types corresponding to the plurality of current training audio segments are different from each other, at this time, the plurality of first current training condition vectors are also different from each other. It should be noted that each current training audio segment corresponds to only one sound source type.

For example, the plurality of first current training target sound sources are different from each other, and the plurality of first current training target sound sources correspond to the plurality of first current training condition vectors one to one.

For example, each current training audio segment may include only one sound event, or may include a plurality of sound events (that is, the current training audio segment is formed by mixing a plurality of sound events), and the sound source types corresponding to the plurality of sound events may be different.

For example, in further embodiments, step S21 further includes: inputting a plurality of current training audio frequency segments and a plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of second current training target sound sources, wherein the plurality of second current training target sound sources correspond to the plurality of current training audio frequency segments one by one; calculating a second loss value of a loss function of the first neural network to be trained according to a plurality of second current training target sound sources and a plurality of current training audio segments; and correcting the parameters of the first neural network to be trained according to the second loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, the plurality of second current training target sound sources are different from each other.

For example, the plurality of second current training target sound sources are all the same in time length.

For example, each training data set further includes a plurality of second training condition vectors in one-to-one correspondence with the plurality of training audio pieces. The current training data set further includes a plurality of second current training condition vectors, the plurality of current training audio segments and the plurality of second current training condition vectors correspond one-to-one, and the second current training condition vector corresponding to each current training audio segment is different from the first current training condition vector corresponding to each current training audio segment.

For example, in other embodiments, step S21 further includes: inputting a plurality of current training audio frequency segments and a plurality of second current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of third current training target sound sources, wherein the plurality of third current training target sound sources correspond to the plurality of current training audio frequency segments one by one; calculating a third loss value of a loss function of the first neural network to be trained according to a plurality of third current training target sound sources and the all-zero vector; and correcting the parameters of the first neural network to be trained according to the third loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

For example, a plurality of third current training target sound sources are different from each other.

For example, the time lengths of the plurality of third current training target sound sources are all the same, and the time length of the all-zero vector may be the same as the time length of each third current training target sound source.

For example, in embodiments of the present disclosure, in some examples, the predetermined condition corresponds to a minimization of a loss function of the first neural network with a certain number of training data sets input. In other examples, the predetermined condition is that the number of training times or training cycles of the first neural network reaches a predetermined number, which may be millions, as long as the training data set is sufficiently large.

For example, in some embodiments, obtaining the current training mixed audio comprises: respectively acquiring a plurality of current training audio clips; performing joint processing on a plurality of current training audio segments to obtain a first intermediate current training mixed audio; and carrying out spectrum transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

For example, respectively obtaining a plurality of current training audio segments includes: processing the plurality of current training audio clips by using a second neural network to obtain a plurality of current training target sound anchors; based on the current training target sound anchors, the current training audio clips are segmented to obtain current training audio segments corresponding to the current training target sound anchors one to one.

For example, each current training audio segment includes a current training target sound anchor corresponding to each current training audio segment.

For example, the second neural network is a neural network that has been trained. The second neural network may utilize sound event detection techniques to detect the current training audio clip.

For example, a plurality of current training audio clips may be obtained from an AudioSet data set.

For example, the plurality of current training audio clips may be the same length of time, e.g., 10 seconds each.

For example, processing the plurality of current training audio clips with the second neural network to obtain a plurality of current training target sound anchors, respectively, includes: for each current training audio clip in the plurality of current training audio clips, processing the current training audio clip by using a second neural network to obtain a current training voice anchor vector corresponding to the current training audio clip; determining at least one current training audio segment of a plurality of current training audio segments corresponding to the current training audio clip; and according to at least one current training audio clip, selecting at least one current training target audio anchor corresponding to the current training audio clip from the current training audio anchor vector, thereby obtaining a plurality of current training target audio anchors.

For example, the second neural network may obtain a plurality of current training audio anchor vectors after processing the plurality of current training audio clips, and the plurality of current training audio anchor vectors correspond to the plurality of current training audio clips one to one. The multiple current training voice anchor vectors are of the same dimension.

For example, each current training voice anchor vector is a one-dimensional vector.

For example, each current training audio anchor vector may include R current training audio anchors, e.g., R may be 527 when multiple current training audio clips are all taken from the AudioSet of data.

For example, the plurality of current training audio clips are different from each other, and the plurality of current training audio clips are in one-to-one correspondence with the plurality of current training audio segments and also in one-to-one correspondence with the plurality of current training target voice anchors. That is, only one current training audio segment is truncated from each current training audio clip.

For example, each current training target audio anchor may include a probability value corresponding to the current training target audio anchor and a point in time of the current training target audio anchor on the current training audio clip corresponding to the current training target audio anchor.

For example, the temporal length of each current training audio segment is less than the temporal length of the corresponding current training audio clip.

For example, the plurality of current training audio clips includes a first current training audio clip, the plurality of current training audio segments includes a first current training audio segment, and the first current training audio segment is truncated from the first current training audio clip. The sound source type corresponding to the first current training audio clip is piano sound. The plurality of current training audio anchor vectors includes a first current training audio anchor vector, and the first current training audio anchor vector corresponds to the first current training audio clip. The plurality of current training target audio anchors includes a first current training target audio anchor corresponding to the first current training audio clip. For example, the time length of the first current training audio piece may be denoted as t.

For a first current training audio clip, firstly, processing the first current training audio clip by using a second neural network to obtain a first current training sound anchor vector corresponding to the first current training audio clip; and according to the first current training audio clip, selecting a first current training target sound anchor corresponding to the first current training audio clip from the first current training sound anchor vector, wherein the first current training target sound anchor represents an anchor with the highest probability of occurrence of the piano sound in the first current training audio clip.

Then, based on the first current training target audio anchor, the first current training audio clip is segmented to obtain a first current training audio segment corresponding to the first current training target audio anchor. For example, a first current training audio clip may be truncated to both sides for a time length of t/2 centered around the first current training target sound anchor to obtain a first current training audio segment. It should be noted that, if the time length from the first current training target sound anchor point to the first endpoint of the first current training audio clip and the time length from the first current training target sound anchor point to the second endpoint of the first current training audio clip are both greater than or equal to t/2, the midpoint of the first current training audio segment is the first current training target sound anchor point; if the time length from the center point to the first endpoint of the first current training audio clip is t/3, and t/3 is less than t/2, cutting t/3 length from one side of the first current training target sound anchor close to the first endpoint of the first current training audio clip and cutting 2t/3 length from one side of the first current training target sound anchor close to the second endpoint of the first current training audio clip, so as to obtain a first current training audio segment, wherein the midpoint of the first current training audio segment is located at one side of the first current training target sound anchor close to the second endpoint of the first current training audio clip.

For example, "jointly processing a plurality of current training audio segments" may mean sequentially arranging the plurality of current training audio segments in time to obtain a complete first intermediate current training mixed audio.

It should be noted that, when a current training audio clip includes a plurality of sound events, and the sound events correspond to different sound source types, the sound events include a target sound event, and the sound source type corresponding to the target sound event is the sound source type corresponding to the current training audio clip, that is, if the sound source type corresponding to the current training audio clip is guitar sound, the target sound event is guitar sound. The current training audio segment is the audio segment with the highest probability of the occurrence of the target sound event (i.e., guitar sound) in the current training audio clip corresponding to the current training audio segment. When a current training audio segment includes only one sound event, the one sound event is the target sound event.

For example, in some embodiments, determining a first plurality of current training condition vectors comprises: and respectively processing the current training audio clips by using a second neural network to obtain a plurality of current training voice anchor vectors in one-to-one correspondence with the current training audio clips, wherein the current training voice anchor vectors are used as a plurality of first current training condition vectors.

For example, each element in each first current training condition vector may have a value range of [0,1 ].

It should be noted that, in other embodiments, each first current training condition vector may also be a vector composed of 0 or 1, that is, each first current training condition vector includes G current training type probability values, a current training type probability value corresponding to a sound source type corresponding to a current training audio segment corresponding to the first current training condition vector in the G current training type probability values is a target current training type probability value, the target current training type probability value is 1, and the remaining current training type probability values except for the target current training type probability value in the G current training type probability values are all 0.

For example, the current training mixed audio is represented as a spectrogram.

For example, in some embodiments, spectrally transforming the first intermediate current training mixed audio to obtain the current training mixed audio comprises: and carrying out short-time Fourier transform processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

For example, in further embodiments, spectrally transforming the first intermediate current training mixed audio to obtain the current training mixed audio comprises: performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain a second intermediate current training mixed audio; and carrying out logarithmic Mel frequency spectrum processing on the second intermediate current training mixed audio to obtain the current training mixed audio.

Fig. 4 is a schematic diagram of a first current training audio clip, a first current training audio anchor vector corresponding to the first current training audio clip, a second current training audio clip, and a second current training audio anchor vector corresponding to the second current training audio clip according to at least one embodiment of the present disclosure. Fig. 5 is a schematic diagram of separating a sound source from a mixture of a first current training audio segment and a second current training audio segment according to at least one embodiment of the present disclosure.

For example, in a particular embodiment, as shown in fig. 4, the plurality of current training audio clips includes a first current training audio clip AU1 and a second current training audio clip AU 2. The plurality of current training audio anchor vectors include a first current training audio anchor vector Pro1 and a second current training audio anchor vector Pro 2. The first current training sound anchor vector Pro1 corresponds to the first current training audio clip AU1, and the second current training sound anchor vector Pro2 corresponds to the second current training audio clip AU 2. The plurality of current training audio segments comprises a first current training audio segment s1 and a second current training audio segment s2, the first current training audio segment s1 being truncated from the first current training audio clip AU1, the second current training audio segment s2 being truncated from the second current training audio clip AU 2.

For example, as shown in fig. 4, the time length of the first current training audio clip AU1 and the time length of the second current training audio clip AU2 are both 10 seconds(s).

For example, the first current training audio clip AU1 may include a gunshot, and the sound source type corresponding to the first current training audio clip s1 is a gunshot, that is, an audio clip related to the gunshot needs to be intercepted from the first current training audio clip AU 1; the second current training audio clip AU2 may include a bell sound and the sound source type corresponding to the second current training audio clip s2 is bell sound, that is, the audio clip associated with bell sound needs to be cut from the second current training audio clip AU 2.

It should be noted that the first current training sound anchor vector Pro1 in fig. 4 only shows the probability distribution of the gunshot sound, and the second current training sound anchor vector Pro2 only shows the probability distribution of the bell sound.

For example, the first current training target sound anchor in the first current training sound anchor vector Pro1 corresponding to the first current training audio segment s1 is tt1, and the second current training target sound anchor in the second current training sound anchor vector Pro2 corresponding to the second current training audio segment s2 is tt 2. According to the first current training target sound anchor point tt1, a first current training audio clip s1 can be obtained by clipping from the first current training audio clip AU1 based on a preset time length (e.g., 2 seconds); according to the second current training target sound anchor point tt2, a second current training audio clip s2 can be clipped from the second current training audio clip AU2 based on the preset time length.

For example, as shown in fig. 5, in the model training process, a first neural network to be trained may be used to perform sound source separation processing on a current training mixed audio (the current training mixed audio is obtained by mixing a first current training audio segment s1 and a second current training audio segment s2) and a first current training condition vector cj, so as to obtain a first current training target sound source sj. In the example shown in fig. 5, the first current training condition vector cj may be a condition vector corresponding to the first current training audio segment s1, if the training of the first neural network to be trained is completed, the first current training target sound source sj and the first current training audio segment s1 should be the same, if the training of the first neural network to be trained is not completed, a first loss value of a loss function of the first neural network to be trained may be calculated according to the first current training target sound source sj and the first current training audio segment s1, and then parameters of the first neural network to be trained may be corrected according to the first loss value.

The model training method of the present disclosure is explained below by taking the examples shown in fig. 4 and 5 as examples.

For example, in the embodiments of the present disclosure, one sound separation system may be constructed for separating all types of sound sources. First, audio clips corresponding to two sound source types (e.g., gunshot and bell) are randomly selected from the AudioSet of data, namely a first current training audio clip AU1 and a second current training audio clip AU2 shown in fig. 4. For each current training audio clip (either the first current training audio clip AU1 or the second current training audio clip AU2), a sound event detection system may be applied to detect the current training audio segment (i.e., the first current training audio segment s1 and the second current training audio segment s2 shown in fig. 4) that contains a sound event. Then, a first current training condition vector c1 may be set for the first current training audio segment s1, and a first current training condition vector c2 may be set for the second current training audio segment s 2. During model training, the sound source separation system can be described as:

wherein f (s1+ s2, cj) represents the sound source separation system, s1+ s2 represents the current training mixed audio, j is 1 or 2, and sj represents the first current training target sound source corresponding to the first current training condition vector cj. The above equation shows that the result sj of the separation depends on the input current training mixed audio and the first current training condition vector cj. The first current training condition vector cj should contain information of a sound source to be separated, i.e., the first current training target sound source.

It should be noted that the AudioSet data is weakly labeled. That is, in the AudioSet data, each 10 second audio clip (i.e. the first current training audio clip AU1 and the second current training audio clip AU2) is only marked as the presence or absence of a sound event, and there is no time of occurrence of a sound event. However, the sound source separation system requires that the spectrogram corresponding to each audio segment used for training (i.e. the first current training audio segment s1 or the second current training audio segment s2 shown in fig. 4) contain sound events. However, there is no time information of the audio piece in the audio clip containing the sound event, i.e. the first current training audio clip AU1 or the second current training audio clip AU 2. To address this problem, a weakly labeled AudioSet data set may be used to train a voice event detection system. For a given sound event, the sound event detection system is used to detect the point in time at which the respective sound event occurred in the 10 second audio clip. Then, a corresponding audio segment containing the sound event is intercepted based on the point in time to train the sound source separation system.

To train a sound event detection system with weakly labeled data in the AudioSet of data, a log mel spectrogram is used as a feature for an audio clip, then a neural network is applied to the log mel spectrogram, to predict the probability of a sound event occurring over time, a time-distributed fully-connected layer is applied to the feature map of the last convolutional layer, so that a certain number of sound classes can be output from the fully-connected layer, then a sigmoid function is applied to the certain number of sound classes, so that an S-shaped nonlinear growth curve is obtained, to predict the probability of a sound event occurring over time (i.e., a time-distributed prediction probability), the time-distributed prediction probability O (t) is expressed as (O t) ∈ [0, 1)]^KT is 1, …, T, where T is the number of time steps in the distributed fully-connected layer and K represents the sound category. In training, probability predictions are obtained by pooling the time-distributed prediction probabilities o (t):

the pooling function may be a maximum pooling function. The audio segment selected by the max-pooling function has a higher accuracy and contains sound events rather than unrelated sounds. In addition, a two-class cross-entropy loss function (i.e., a cross-entropy loss function) may be employed

) A loss value of the sound event detection system is calculated.

For example, a 10 second audio clip in an AudioSet data set is used as input to a sound event detection system, resulting in a temporal distribution prediction of sound events o (t) (vector Pro1 or vector Pro2 shown in fig. 4). For the sound category, the time step with the highest probability is selected as the anchor point. Then, an audio segment centered on the anchor point (s1 or s2 described above) is selected to train the sound source separation system.

For example, in the embodiment of the present disclosure, a clean sound source is not necessarily required for training a sound source separation system, and by correctly setting the condition vector, it is possible to train the sound source separation system based on the audio segments obtained from the AudioSet data set (i.e., the first current training audio segment s1 and the second current training audio segment s 2). For example, the first current training sound anchor vector Pro1 processed by the sound source event detection system for the first current training audio clip AU1 is used as the condition vector of the first current training audio segment s1 (i.e., the first current training condition vector c1 described above), and the second current training sound anchor vector Pro2 processed by the sound source event detection system for the second current training audio clip AU2 is used as the condition vector of the second current training audio segment s2 (i.e., the first current training condition vector c2 described above). The first current training sound anchor vector Pro1 may represent the sound events and their probability of presence comprised by the first current training audio clip AU1 and the second current training sound anchor vector Pro2 may represent the sound events and their probability of presence comprised by the second current training audio clip AU 2. The first current training sound anchor vector Pro1 and the second current training sound anchor vector Pro2 may better reflect sound events in the first current training audio segment s1 and the second current training audio segment s2 than the labels of the first current training audio clip AU1 or the second current training audio clip AU 2. In training, the following regression can be learned for the sound source separation system:

in the above equations (1) to (3), j is 1 or 2. Equation (1) represents learning from the current training mixed audio to the first current training target sound source sj conditioned on the first current training condition vector cj. Equation (2) represents learning the identity mapping, i.e., the sound separation system should learn the output conditioned on itself, whereby the distortion of the separated signal can be reduced. Equation (3) represents a zero mapping, that is, if the system is running with a second current training condition vector c that is different from the first current training condition vector cj_-jConditional, then all zero vectors 0 (i.e. no sound) should be output.

For example, a training audio clip of arbitrary length is provided, and first, an audio tagging system is used to predict whether a sound event is present in the training audio clip. Then, a list of sound categories to which the training audio clips correspond is obtained. For each sound k in the candidate list, the training condition vector ck is set to {0, …,0, 1,0, …, 0}, where the k-th element of ck is 1 and the other elements are 0. Based on this training condition vector ck, a clean target sound source can be separated even if the audio segment used for training in the training process is a segment including a plurality of sound events (i.e., an unclean audio segment).

For example, in some embodiments, the above-described sound source separation system is experimented on an AudioSet of data that is a large-scale audio data set having 527 sound classes, a training set in the AudioSet of data includes 2,063,839 audio clips, and a balanced subset of 22,050 audio clips, an evaluation set in the AudioSet of data includes 20,371 audio clips, most of the audio clips have a duration (i.e., time length) of 10 seconds, all of the audio clips are converted to mono at a sampling rate of 32kHz, a sound event detection system is trained on the complete training set, a convolutional neural network for the sound event detection system includes 13 layers, and a fully-connected layer of time distribution is applied to the last convolutional layer to obtain a time distribution probability distribution of the sound event, the neural network also includes 5 convolutional neural network probability distribution, 5 neural network further includes 5 audio rate cells, a kernel rate of 75.3.3 channels, and an optimized rate for the sound event detection system is 360.26, 360.27 channels, an adaptive rate estimator is used to detect sound events, and an attenuation factor of 360.3 is used to obtain an optimized rate for the audio event detection system.

For an audio clip containing a certain sound category, obtaining an anchor point by acquiring the maximum prediction probability of a sound event corresponding to the sound category from a sound event detection system; an audio segment is then determined based on the anchor point (the audio segment comprises 5 adjacent time frames, such that the temporal length of the audio segment is 1.6 seconds).

The inputs to the U-net network are a current training mixed audio (which is a mixture of a first current training audio segment s1 and a second current training audio segment s2, and represents a spectrogram, which can be obtained by applying STFT on a sound waveform with a window size of 1024 and a hop count of 256) and a first current training condition vector.

Some embodiments of the present disclosure also provide a sound source separating apparatus. Fig. 6 is a schematic block diagram of a sound source separation apparatus according to at least one embodiment of the present disclosure.

As shown in fig. 6, the sound source separating device 60 includes a memory 610 and a processor 620. The memory 610 is used for non-transitory storage of computer readable instructions. The processor 620 is configured to execute computer readable instructions, and the computer readable instructions are executed by the processor 620 to perform the sound source separation method provided by any of the embodiments of the present disclosure.

For example, the memory 610 and the processor 620 may be in direct or indirect communication with each other. For example, components such as the memory 610 and the processor 620 may communicate over a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination thereof, and/or the like. The wired network may communicate by using twisted pair, coaxial cable, or optical fiber transmission, for example, and the wireless network may communicate by using 3G/4G/5G mobile communication network, bluetooth, Zigbee, or WiFi, for example. The present disclosure is not limited herein as to the type and function of the network.

For example, the processor 620 may control other components in the sound source separating device 60 to perform desired functions. The processor 620 may be a Central Processing Unit (CPU), Tensor Processor (TPU), or the like having data processing capability and/or program execution capability. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard.

For example, memory 610 may include any combination of one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer instructions may be stored on the memory 610 and executed by the processor 620 to implement various functions. Various applications and various data, such as a training data set, training reference tags, and various data used and/or generated by the applications, may also be stored in the memory 610.

For example, in some embodiments, the sound source separation device 60 may further include a sound collection device, which may be, for example, a microphone or the like. The sound collection device is used for collecting original mixed audio.

For example, for a detailed description of the processing procedure of the sound source separation method, reference may be made to the related description in the above embodiment of the sound source separation method, and repeated descriptions are omitted.

It should be noted that the sound source separation device provided in the embodiments of the present disclosure is illustrative and not restrictive, and the sound source separation device may further include other conventional components or structures according to practical application needs, for example, in order to implement the necessary function of the sound source separation device, a person skilled in the art may set other conventional components or structures according to a specific application scenario, and the embodiments of the present disclosure are not limited thereto.

For technical effects of the sound source separation device provided by the embodiments of the present disclosure, reference may be made to the corresponding description of the sound source separation method in the foregoing embodiments, and details are not repeated herein.

Some embodiments of the present disclosure also provide a model training device. Fig. 7 is a schematic block diagram of a model training apparatus according to at least one embodiment of the present disclosure.

As shown in FIG. 7, model training apparatus 70 includes a memory 710 and a processor 720. The memory 710 is used for non-transitory storage of computer readable instructions. The processor 720 is configured to execute computer-readable instructions, and the computer-readable instructions are executed by the processor 720 to perform a model training method according to any of the embodiments of the present disclosure.

For example, the processor 730 may control other components in the model training device 70 to perform desired functions. Processor 730 may be a Central Processing Unit (CPU), Tensor Processor (TPU), or the like having data processing and/or program execution capabilities. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard.

For example, memory 720 may include any combination of one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-readable instructions may be stored on the memory 720 and executed by the processor 730 to implement the various functions of the model training device 70.

It should be noted that, for the detailed description of the process of performing the model training by the model training apparatus 70, reference may be made to the related description in the embodiment of the model training method, and repeated descriptions are omitted here.

Some embodiments of the present disclosure also provide a storage medium. Fig. 8 is a schematic block diagram of a storage medium provided in at least one embodiment of the present disclosure. For example, as shown in FIG. 8, one or more computer readable instructions 801 may be stored non-temporarily on a storage medium 800. For example, some of the computer readable instructions 801, when executed by a computer, may perform one or more steps of a sound source separation method according to the above; another portion of the computer readable instructions 801, when executed by a computer, may perform one or more steps of a method of training a model according to the above.

For example, the storage medium 800 may be applied to the sound source separating device 60 and/or the model training device 70 described above, and may be, for example, the memory 610 in the sound source separating device 60 and/or the memory 720 in the model training device 70.

For example, the description of the storage medium 800 may refer to the description of the memory in the embodiment of the sound source separating device 60 and/or the model training device 70, and the repeated description is omitted.

Referring now to fig. 9, a schematic diagram of an electronic device (e.g., an electronic device may include the sound source separation apparatus described in the above embodiments) 600 suitable for implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 606 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 606 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 9 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network through the communication device 609, or installed from the storage device 606, or installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that in the context of this disclosure, a computer-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be a computer readable signal medium or a computer readable storage medium or any combination of the two. The computer-readable storage medium may be, for example, but is not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

According to one or more embodiments of the present disclosure, a sound source separation method includes: acquiring mixed audio; determining a sound source tag group corresponding to the mixed audio; determining a condition vector group according to the sound source label group; and inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain a target sound source group, wherein the target sound sources in the target sound source group correspond to the condition vectors of the condition vector group one by one.

According to one or more embodiments of the present disclosure, the condition vector group includes a plurality of condition vectors, and inputting the condition vector group and the mixed audio to the first neural network for sound source separation processing to obtain the target sound source group includes: determining a plurality of input data sets according to the mixed audio and a plurality of condition vectors, wherein the plurality of input data sets correspond to the plurality of condition vectors in a one-to-one mode, and each of the plurality of input data sets comprises the mixed audio and one condition vector in the plurality of condition vectors; and respectively carrying out sound source separation processing on the plurality of input data groups by utilizing a first neural network to obtain a target sound source group, wherein the target sound source group comprises a plurality of target sound sources which are in one-to-one correspondence with a plurality of condition vectors, the plurality of input data groups are in one-to-one correspondence with the plurality of target sound sources, and each target sound source in the plurality of target sound sources corresponds to the condition vector in the input data group corresponding to each target sound source.

In accordance with one or more embodiments of the present disclosure, the plurality of condition vectors are different from one another.

According to one or more embodiments of the present disclosure, acquiring mixed audio includes: acquiring original mixed audio; and carrying out spectrum transformation processing on the original mixed audio to obtain the mixed audio.

According to one or more embodiments of the present disclosure, performing a spectral transform process on original mixed audio to obtain mixed audio includes: and carrying out short-time Fourier transform processing on the original mixed audio to obtain the mixed audio.

According to one or more embodiments of the present disclosure, performing a spectral transform process on original mixed audio to obtain mixed audio includes: carrying out short-time Fourier transform processing on the original mixed audio to obtain intermediate mixed audio; the intermediate mixed audio is subjected to logarithmic mel-frequency spectrum processing to obtain mixed audio.

According to one or more embodiments of the present disclosure, determining a sound source tag group corresponding to mixed audio includes: performing sound event detection on the mixed audio by using a second neural network to determine a sound event group included in the mixed audio; a set of sound source tags is determined from the set of sound events.

According to one or more embodiments of the present disclosure, the condition vector group includes at least one condition vector, the sound source tag group includes at least one sound source tag, the at least one condition vector corresponds to the at least one sound source tag one to one, each condition vector in the at least one condition vector includes N type probability values, a type probability value corresponding to a sound source type corresponding to the sound source tag corresponding to each condition vector in the N type probability values is a target type probability value, the target type probability value is 1, and the remaining type probability values except the target type probability value in the N type probability values are all 0, where N is a positive integer.

According to one or more embodiments of the present disclosure, the first neural network is a U-shaped neural network.

According to one or more embodiments of the present disclosure, a model training method of a neural network includes: the method comprises the steps of obtaining a training sample set, wherein the training sample set comprises a plurality of training data sets, each training data set comprises training mixed audio, a plurality of training audio segments and a plurality of first training condition vectors, the training mixed audio comprises a plurality of training audio segments, and the plurality of first training condition vectors correspond to the plurality of training audio segments one to one; training a first neural network to be trained by using a training sample set to obtain the first neural network, wherein the first neural network to be trained comprises a loss function, and training the first neural network to be trained by using the training sample set to obtain the first neural network comprises: acquiring a current training data set from a training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio segments, and the current training mixed audio comprises a plurality of current training audio segments; determining a plurality of first current training condition vectors which are in one-to-one correspondence with a plurality of current training audio segments, wherein the current training data set further comprises a plurality of first current training condition vectors, and inputting the current training mixed audio and the plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of first current training target sound sources; calculating a first loss value of a loss function of a first neural network to be trained according to a plurality of first current training target sound sources and a plurality of current training audio segments; and correcting parameters of the first neural network to be trained according to the first loss value, obtaining the trained first neural network when the loss function meets a preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

According to one or more embodiments of the present disclosure, training the first neural network to be trained by using the training sample set to obtain the first neural network further includes: inputting a plurality of current training audio frequency segments and a plurality of first current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of second current training target sound sources, wherein the plurality of second current training target sound sources correspond to the plurality of current training audio frequency segments one by one; calculating a second loss value of a loss function of the first neural network to be trained according to a plurality of second current training target sound sources and a plurality of current training audio segments; and correcting the parameters of the first neural network to be trained according to the second loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

According to one or more embodiments of the present disclosure, each training data set further includes a plurality of second training condition vectors corresponding to the plurality of training audio segments one to one, the current training data set further includes a plurality of second current training condition vectors, the plurality of current training audio segments correspond to the plurality of second current training condition vectors one to one, the second current training condition vector corresponding to each current training audio segment is different from the first current training condition vector corresponding to each current training audio segment, the training of the first neural network to be trained is performed by using the training sample set, so as to obtain the first neural network, further including: inputting a plurality of current training audio frequency segments and a plurality of second current training condition vectors into a first neural network to be trained for sound source separation processing to obtain a plurality of third current training target sound sources, wherein the plurality of third current training target sound sources correspond to the plurality of current training audio frequency segments one by one; calculating a third loss value of a loss function of the first neural network to be trained according to a plurality of third current training target sound sources and the all-zero vector; and correcting the parameters of the first neural network to be trained according to the third loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

In accordance with one or more embodiments of the present disclosure, obtaining the current training mixed audio includes: respectively acquiring a plurality of current training audio clips; performing joint processing on a plurality of current training audio segments to obtain a first intermediate current training mixed audio; and carrying out spectrum transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

According to one or more embodiments of the present disclosure, respectively obtaining a plurality of current training audio segments includes: processing the plurality of current training audio clips by using a second neural network to obtain a plurality of current training target sound anchors; based on the current training target sound anchors, the current training audio clips are segmented to obtain current training audio segments corresponding to the current training target sound anchors one to one, wherein each current training audio segment comprises a current training target sound anchor corresponding to each current training audio segment.

In accordance with one or more embodiments of the present disclosure, processing a plurality of current training audio clips with a second neural network to obtain a plurality of current training target sound anchors, respectively, comprises: for each current training audio clip in the plurality of current training audio clips, processing each current training audio clip using a second neural network to obtain a current training voice anchor vector corresponding to each current training audio clip; determining at least one current training audio segment of a plurality of current training audio segments corresponding to each current training audio clip; and according to at least one current training audio segment, selecting at least one current training target audio anchor corresponding to each current training audio clip from the current training audio anchor vector, thereby obtaining a plurality of current training target audio anchors.

According to one or more embodiments of the present disclosure, the plurality of current training audio clips are different from each other, and the plurality of current training audio clips correspond one-to-one to the plurality of current training audio segments and also correspond one-to-one to the plurality of current training target sound anchors.

In accordance with one or more embodiments of the present disclosure, determining a plurality of first current training condition vectors comprises: and respectively processing the current training audio clips by using a second neural network to obtain a plurality of current training voice anchor vectors in one-to-one correspondence with the current training audio clips, wherein the current training voice anchor vectors are used as a plurality of first current training condition vectors.

According to one or more embodiments of the present disclosure, performing a spectral transformation process on the first intermediate current training mixed audio to obtain a current training mixed audio includes: and carrying out short-time Fourier transform processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

According to one or more embodiments of the present disclosure, performing a spectral transformation process on the first intermediate current training mixed audio to obtain a current training mixed audio includes: performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain a second intermediate current training mixed audio; and carrying out logarithmic Mel frequency spectrum processing on the second intermediate current training mixed audio to obtain the current training mixed audio.

According to one or more embodiments of the present disclosure, the time lengths of the plurality of current training audio pieces are all the same.

According to one or more embodiments of the present disclosure, a sound source separating apparatus includes: a memory for non-transitory storage of computer readable instructions; and a processor for executing computer readable instructions, the computer readable instructions being executed by the processor to perform the sound source separation method according to any of the above embodiments.

According to one or more embodiments of the present disclosure, a model training apparatus includes: a memory for non-transitory storage of computer readable instructions; and a processor for executing computer readable instructions, the computer readable instructions being executed by the processor to perform the model training method according to any of the above embodiments.

According to one or more embodiments of the present disclosure, a storage medium non-transitory stores computer readable instructions that when executed by a computer may perform a sound source separation method according to any one of the above embodiments.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

For the present disclosure, there are also the following points to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) Thicknesses and dimensions of layers or structures may be exaggerated in the drawings used to describe embodiments of the present invention for clarity. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.

(3) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A sound source separation method, comprising:

acquiring mixed audio;

determining a sound source tag group corresponding to the mixed audio;

determining a condition vector group according to the sound source label group;

and inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain a target sound source group, wherein target sound sources in the target sound source group correspond to condition vectors of the condition vector group one by one.

2. The sound source separation method according to claim 1, wherein the condition vector group includes a plurality of condition vectors,

inputting the condition vector group and the mixed audio into a first neural network for sound source separation processing to obtain the target sound source group, wherein the step of inputting the condition vector group and the mixed audio into the first neural network comprises the following steps:

determining a plurality of input data sets according to the mixed audio and the plurality of condition vectors, wherein the plurality of input data sets correspond to the plurality of condition vectors in a one-to-one manner, and each of the plurality of input data sets comprises the mixed audio and one condition vector of the plurality of condition vectors;

utilizing the first neural network to respectively carry out sound source separation processing on the plurality of input data sets so as to obtain the target sound source set, wherein the target sound source set comprises a plurality of target sound sources corresponding to the plurality of condition vectors one by one, the plurality of input data sets correspond to the plurality of target sound sources one by one, and each target sound source in the plurality of target sound sources corresponds to the condition vector in the input data set corresponding to each target sound source.

3. The sound source separation method according to claim 2, wherein the plurality of condition vectors are different from each other.

4. The sound source separation method according to claim 1, wherein acquiring the mixed audio includes:

acquiring original mixed audio;

and carrying out spectrum transformation processing on the original mixed audio to obtain the mixed audio.

5. The sound source separation method according to claim 4, wherein the subjecting the original mixed audio to the spectral transform processing to obtain the mixed audio comprises:

and carrying out short-time Fourier transform processing on the original mixed audio to obtain the mixed audio.

6. The sound source separation method according to claim 4, wherein the subjecting the original mixed audio to the spectral transform processing to obtain the mixed audio comprises:

carrying out short-time Fourier transform processing on the original mixed audio to obtain intermediate mixed audio;

and carrying out logarithmic Mel frequency spectrum processing on the intermediate mixed audio to obtain the mixed audio.

7. The sound source separation method according to claim 1, wherein determining the sound source tag group corresponding to the mixed audio comprises:

performing sound event detection on the mixed audio by utilizing a second neural network to determine a sound event group included in the mixed audio;

and determining the sound source label group according to the sound event group.

8. The sound source separation method according to claim 1, wherein the condition vector group includes at least one condition vector, the sound source tag group includes at least one sound source tag, the at least one condition vector corresponds one-to-one to the at least one sound source tag,

each condition vector in the at least one condition vector comprises N type probability values, the type probability value corresponding to the sound source type corresponding to the sound source label corresponding to each condition vector in the N type probability values is a target type probability value, the target type probability value is 1, the other type probability values except the target type probability value in the N type probability values are all 0, wherein N is a positive integer.

9. The sound source separation method according to any one of claims 1 to 8, wherein the first neural network is a U-shaped neural network.

10. A method of model training of a neural network, comprising:

obtaining a training sample set, wherein the training sample set comprises a plurality of training data sets, each training data set comprises training mixed audio, a plurality of training audio segments and a plurality of first training condition vectors, the training mixed audio comprises the plurality of training audio segments, and the plurality of first training condition vectors are in one-to-one correspondence with the plurality of training audio segments;

training a first neural network to be trained by using the training sample set to obtain a first neural network, wherein the first neural network to be trained comprises a loss function,

wherein, training the first neural network to be trained by using the training sample set to obtain the first neural network comprises:

obtaining a current training data set from the training sample set, wherein the current training data set comprises a current training mixed audio and a plurality of current training audio segments, and the current training mixed audio comprises the plurality of current training audio segments;

determining a plurality of first current training condition vectors in one-to-one correspondence with the plurality of current training audio segments, wherein the current training data set further includes the plurality of first current training condition vectors,

inputting the current training mixed audio and the plurality of first current training condition vectors into the first neural network to be trained for sound source separation processing to obtain a plurality of first current training target sound sources;

calculating a first loss value of a loss function of the first neural network to be trained according to the first current training target sound sources and the current training audio segments;

and correcting parameters of the first neural network to be trained according to the first loss value, obtaining the trained first neural network when the loss function meets a preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

11. The model training method of claim 10, wherein training the first neural network to be trained using the training sample set to obtain a first neural network further comprises:

inputting the current training audio segments and the first training condition vectors into the first neural network to be trained for sound source separation processing to obtain a plurality of second current training target sound sources, wherein the second current training target sound sources correspond to the current training audio segments one by one;

calculating a second loss value of the loss function of the first neural network to be trained according to the second current training target sound sources and the current training audio segments;

and correcting the parameters of the first neural network to be trained according to the second loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

12. The model training method according to claim 10 or 11, wherein each training data set further includes a plurality of second training condition vectors in one-to-one correspondence with the plurality of training audio pieces,

the current training data set further includes a plurality of second current training condition vectors, the plurality of current training audio segments and the plurality of second current training condition vectors correspond one-to-one, a second current training condition vector corresponding to each current training audio segment is different from a first current training condition vector corresponding to each current training audio segment,

training the first neural network to be trained by using the training sample set to obtain a first neural network, further comprising:

inputting the current training audio segments and the second training condition vectors into the first neural network to be trained for sound source separation processing to obtain a third current training target sound sources, wherein the third current training target sound sources correspond to the current training audio segments one by one;

calculating a third loss value of the loss function of the first neural network to be trained according to the plurality of third current training target sound sources and the all-zero vector;

and correcting the parameters of the first neural network to be trained according to the third loss value, obtaining the trained first neural network when the loss function meets the preset condition, and continuously inputting the current training data set to repeatedly execute the training process when the loss function does not meet the preset condition.

13. The model training method of claim 10, wherein obtaining the current training mixed audio comprises:

respectively acquiring the plurality of current training audio clips;

performing joint processing on the plurality of current training audio segments to obtain a first intermediate current training mixed audio;

and carrying out spectrum transformation processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

14. The model training method of claim 13, wherein obtaining the plurality of current training audio segments, respectively, comprises:

processing the plurality of current training audio clips by using a second neural network to obtain a plurality of current training target sound anchors;

based on the current training target sound anchors, performing segmentation processing on the current training audio clips to obtain current training audio segments corresponding to the current training target sound anchors one to one, wherein each current training audio segment includes a current training target sound anchor corresponding to each current training audio segment.

15. The model training method of claim 14, wherein processing the plurality of current training audio clips with the second neural network to obtain the plurality of current training target sound anchors, respectively, comprises:

for each current training audio clip in the plurality of current training audio clips, processing the each current training audio clip using the second neural network to obtain a current training voice anchor vector corresponding to the each current training audio clip;

determining at least one current training audio segment of the plurality of current training audio segments corresponding to the each current training audio clip;

and according to the at least one current training audio clip, selecting and obtaining at least one current training target audio anchor corresponding to each current training audio clip from the current training audio anchor vector, thereby obtaining the multiple current training target audio anchors.

16. The model training method of claim 14, wherein the plurality of current training audio clips are distinct from one another,

the plurality of current training audio clips are in one-to-one correspondence with the plurality of current training audio segments and are also in one-to-one correspondence with the plurality of current training target voice anchors.

17. The model training method of claim 16, wherein determining the plurality of first current training condition vectors comprises:

and processing the current training audio clips by using the second neural network respectively to obtain a plurality of current training acoustic anchor vectors in one-to-one correspondence with the current training audio clips, wherein the current training acoustic anchor vectors are used as the first current training condition vectors.

18. The model training method of claim 13, wherein spectrally transforming the first intermediate current training mixed audio to obtain the current training mixed audio comprises:

and carrying out short-time Fourier transform processing on the first intermediate current training mixed audio to obtain the current training mixed audio.

19. The model training method of claim 13, wherein spectrally transforming the first intermediate current training mixed audio to obtain the current training mixed audio comprises:

performing short-time Fourier transform processing on the first intermediate current training mixed audio to obtain a second intermediate current training mixed audio;

and carrying out logarithmic Mel frequency spectrum processing on the second intermediate current training mixed audio to obtain the current training mixed audio.

20. The model training method of claim 13, wherein the plurality of current training audio segments are all the same length in time.

21. A sound source separation apparatus comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions, which when executed by the processor, perform the sound source separation method according to any one of claims 1-9.

22. A model training apparatus comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions, which when executed by the processor, perform the model training method of any one of claims 10-20.

23. A storage medium storing, non-temporarily, computer-readable instructions which, when executed by a computer, can perform the sound source separation method according to any one of claims 1 to 9.