CN111243620A

CN111243620A - Voice separation model training method and device, storage medium and computer equipment

Info

Publication number: CN111243620A
Application number: CN202010013978.7A
Authority: CN
Inventors: 王珺; 林永业; 苏丹; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-06-05
Anticipated expiration: 2040-01-07
Also published as: EP4002362A4; EP4002362B1; WO2021139294A1; US11908455B2; US20220172708A1; EP4002362A1; CN111243620B

Abstract

The application relates to a method, a device, a computer readable storage medium and a computer device for training a speech separation model, wherein the method comprises the following steps: acquiring a first audio and a second audio; the first audio comprises a target audio and a corresponding annotated audio; the second audio comprises noise audio; acquiring a coding model, an extraction model and an initial estimation model; carrying out unsupervised training on the coding model, the extraction model and the estimation model according to the second audio, and adjusting model parameters of the extraction model and the estimation model; carrying out supervised training on the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting model parameters of the coding model; and continuing to perform the unsupervised training and the supervised training so that the unsupervised training and the supervised training are overlapped, and ending the training until a training stopping condition is met. The scheme provided by the application can realize the reduction of the model training cost.

Description

Voice separation model training method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for training a speech separation model, a storage medium, and a computer device.

Background

Speech, as an acoustic representation of language, is one of the most natural and efficient ways for humans to communicate information. People inevitably suffer from environmental noise or interference of other speakers during voice communication. These disturbances cause the captured audio to be non-pure speaker speech. In recent years, many speech separation models have been trained to separate the target speaker's speech from the mixed audio. However, the current speech separation model is usually trained by supervised learning, which requires manually collecting or labeling high-quality training samples, and such training process is expensive.

Disclosure of Invention

Therefore, it is necessary to provide a method, an apparatus, a storage medium, and a computer device for training a speech separation model to solve the technical problem of high cost of the existing model training method.

A method of speech separation model training, comprising:

acquiring a first audio and a second audio; the first audio comprises a target audio and a corresponding annotated audio; the second audio comprises noise audio;

acquiring a coding model, an extraction model and an initial estimation model;

unsupervised training of the coding model, the extraction model and the estimation model according to the second audio, and adjustment of model parameters of the extraction model and the estimation model;

carrying out supervised training on the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting model parameters of the coding model;

continuing the unsupervised training and the supervised training to overlap the unsupervised training and the supervised training until a training stop condition is met and ending the training;

wherein the output of the coding model is the input of the extraction model; the output of the coding model and the output of the extraction model are together the input of the estimation model; the coding model and the extraction model are jointly used for speech separation.

A speech separation model training apparatus comprising:

the acquisition module is used for acquiring a first audio and a second audio; the first audio comprises a target audio and a corresponding annotated audio; the second audio comprises noise audio; acquiring a coding model, an extraction model and an initial estimation model;

a first training module, configured to perform unsupervised training on the coding model, the extraction model, and the estimation model according to the second audio, and adjust model parameters of the extraction model and the estimation model;

the second training module is used for carrying out supervised training on the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting the model parameters of the coding model;

an overlapping module, configured to continue the unsupervised training and the supervised training, so that the unsupervised training and the supervised training are overlapped, and the training is ended until a training stop condition is met;

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the above-mentioned speech separation model training method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the above-mentioned speech separation model training method.

The speech separation model training method, the device, the computer readable storage medium and the computer equipment provide a model training mode combining unsupervised learning and supervised learning in an overlapping mode, and optimize model parameters of an extraction model and an estimation model by combining the estimation model and unsupervised training samples unsupervised training coding model, the extraction model and the estimation model on the basis of a pre-trained coding model and the extraction model; and optimizing the model parameters of the coding model by using the marked training sample supervised training coding model and the extraction model, and overlapping the unsupervised training and the supervised training until finishing the training. Therefore, the representation capability learned by unsupervised learning and the distinguishing capability learned by supervised learning are mutually optimized in iteration, so that the effect of the trained coding model and the trained extraction model is better when speech is separated, only a small amount of labeled samples are needed in the model training process, and the cost is greatly reduced.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating a method for training a speech separation model in one embodiment;

FIG. 2 is a diagram illustrating a model structure of a speech separation model training method according to an embodiment;

FIG. 3 is a schematic diagram of an embodiment of an unsupervised training process;

FIG. 4 is a schematic flow chart of supervised training in one embodiment;

FIG. 5 is a diagram of an application environment of a speech separation scenario in one embodiment;

FIG. 6 is a flow diagram illustrating speech separation in one embodiment;

FIG. 7 is a block diagram showing the structure of a speech separation model training apparatus according to an embodiment;

FIG. 8 is a block diagram showing the construction of a speech separation model training apparatus according to another embodiment;

FIG. 9 is a block diagram showing the construction of a speech separation model training apparatus according to another embodiment;

FIG. 10 is a block diagram showing a configuration of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among the key technologies of Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to technologies such as artificial intelligent voice, machine learning/deep learning and the like, and is specifically explained by the following embodiment.

In the embodiment of the application, the coding model and the extraction model obtained after the unsupervised training and the supervised training overlap training are finished can be jointly used for voice separation. Speech Separation (Speech Separation) may refer to separating the target audio from the mixed audio. The mixed audio may be the target speaker's voice mixed with noise or the target speaker's voice mixed with other speaker's voices. The target audio here may be a target speaker voice. Thus, the voice separation can be to separate the pure targeted speaker's voice from the targeted speaker's voice mixed with noise.

As shown in FIG. 1, in one embodiment, a method of speech separation model training is provided. The embodiment is mainly illustrated by applying the method to computer equipment. The computer device may be a terminal or a server. Referring to fig. 1, the method for training the speech separation model specifically includes the following steps:

s102, acquiring a first audio and a second audio; the first audio comprises a target audio and a corresponding annotated audio; the second audio includes noise audio.

Wherein the first audio and the second audio are both audio used as model training data. The target audio is audio separated as a target from the first audio. The annotation audio is audio that is used as a model training label. The annotation audio of the first audio comprising the target audio is the clean target audio. The first audio is mixed audio, which may also include noisy audio. Noise audio is a concept of relative target audio, and a sound signal other than the target audio is noise audio. The target audio may specifically be the voice of a speaker or the melody of a musical instrument being played, or the like. The noise audio may specifically be an interfering sound, an ambient sound, or a non-target voice or melody, etc.

For example, when the voice collecting device collects the speaking voice of the target speaker in the far field, the surrounding environment voice and the speaking voice of other speakers are around. Then, the collected audio includes the voice of the target speaker, the environmental sound signal, the voice of other speakers, and the like; the collected audio can be used as a first audio, the target audio included in the first audio is the voice of the target speaker, and the noise audio included in the first audio is an environmental sound signal and the voices of other speakers.

Since the target audio and the noise audio are included in the first audio and the annotation audio is present, the computer device can employ the first audio and the corresponding annotation audio to train the model supervised.

The second audio may be a mono audio or a mixed audio. When the second audio is a single audio, the second audio is a clean noise audio. When the second audio is clean noise audio, the noise audio is background audio or interference audio. The background audio and the interfering audio may be considered to be free of speaker voices. When the second audio is mixed audio, the second audio includes a target audio and a noise audio.

Because the second audio includes noise audio and there is no annotation audio, the computer device can employ the second audio to unsupervised train the model.

In a specific embodiment, the first audio and the second audio are single channel audio; the first audio is mixed audio including a target audio; the marked audio of the first audio is pure target audio; the second audio includes clean noise audio and mixed audio including the noise audio.

Specifically, the first audio and the second audio are both audio collected by a single microphone, that is, single-channel audio. The first audio is mixed audio including a target audio. The labeled audio of the first audio is a pure target audio. The second audio may be clean noise audio or mixed audio including noise audio.

For example, the target audio is the voice of the target speaker, the noise audio is the environmental sound of public places such as train stations or shopping malls, the first audio is the audio of the far-field recorded speaker voice, and the labeled voice of the first audio may be the audio of the near-field recorded speaker voice. The second audio frequency can be the audio frequency of the far-field recorded speaker voice or the background sound recorded when no speaker speaks.

In one embodiment, the computer device can mix the clean target speaker's voice with other speakers' voices or with ambient background sounds to obtain a first audio; thus, the pure speech of the target speaker can be used as the annotation audio of the first audio obtained by mixing. Here, the pure speech of the target speaker may be recorded in a quiet environment or recorded by a near-field microphone.

In one specific embodiment, the computer device can simultaneously use the far-field microphone and the near-field microphone to collect the voice of the target speaker, and use the audio collected by the far-field microphone as the first audio and the audio collected by the near-field microphone as the labeled audio of the first audio. It can be understood that the far-field microphone is far from the target speaker, and it can be considered that the far-field microphone collects not only the voice of the target speaker but also the environmental background sound and/or the voice of other speakers when collecting the voice, that is, it is considered that the far-field microphone collects the mixed audio including the voice of the target speaker, which can be used as the first audio. The near-field microphone is close to the target speaker, for example, near the mouth of the target speaker, and the voice of the target speaker collected by the near-field microphone can be considered to be pure, that is, the voice can be used as the labeled audio of the first audio.

S104, acquiring a coding model, an extraction model and an initial estimation model; wherein, the output of the coding model is the input of the extraction model; the output of the coding model and the output of the extraction model are jointly the input of the estimation model; the coding model and the extraction model are used together for speech separation.

Among them, an encoding model (Encoder) is a machine learning model for mapping low-dimensional data to high-dimensional data. The dimensions of the low-dimensional data are lower than those of the high-dimensional data, and are therefore referred to as low-dimensional data and high-dimensional data, respectively. An extraction model (abstract) is a machine learning model used to build abstract tokens from input. The estimation model (Estimator) is a machine learning model used to estimate the mutual information between two inputs.

For example, referring to fig. 2, the connection relationship between the coding model (Encoder), the extraction model (Abstractor) and the estimation model (Estimator) may be specifically: the output of the coding model is the input of the extraction model; the output of the coding model and the output of the extraction model together are the input of the estimation model. The input of the coding model is the time-frequency points of the labeled mixed signal and the unlabeled mixed signal (labeled & unlabeled Speech), and the clean noise signal (noise). The coding model maps the time-frequency points of the input domain to an Embedding space (Embedding space) to obtain the Embedding characteristics of the Embedding space; the extraction model then extracts abstract features (abstrate features) of the target speaker's voice from the embedded features. The output of the estimation model is a mutual information estimator.

The coding model and the extraction model are used together for speech separation. That is, the coding model and the extraction model are used together to separate the target audio from the mixed audio. The coding model and the extraction model are components of a voice separation model, and the voice separation model comprises a coding model and an extraction model.

It will be appreciated that in most industrial applications where speech enhancement, separation is actually used, the annotated audio will often cover only a small portion of the application scenario, while the large amount of data is unlabeled. In addition to the problem of training data acquisition efficiency, supervised learning relying solely on labeled data also has robustness and generalization issues. For example, a speech feature learned from a noisy speech environment using supervised learning alone is often not suitable for use in another background noise environment. Thus, in embodiments provided herein, a computer device can utilize a large number of unlabeled audio and estimation models to optimize the discriminative power of supervised learning using the robustness and generalization of the characterization power of unsupervised learning; and the characterization ability of unsupervised learning is optimized by using the distinguishing ability of supervised learning. The discrimination capability learned by supervised training is to discriminate the target audio in the mixed audio.

The computer device may obtain an initial coding model, an extraction model, and an estimation model, and perform subsequent training on the models, so that the trained coding model and extraction model can be jointly applied to speech separation.

In one embodiment, the computer device may employ annotated audio to supervised pre-train the coding model and extraction model, and the specific pre-training process may refer to the description in the subsequent embodiments. Thus, the computer equipment can obtain a coding model and an extraction model obtained by pre-training and an initial estimation model; therefore, the models are trained subsequently, so that the models are higher in accuracy.

In a specific embodiment, the encoding model (Encoder) and the extraction model (abstract) may specifically adopt a Bi-directional Long Short-Term Memory (BiLSTM) structure, a Convolutional Neural Network (CNN) structure, or a model combining other network results. Other network structures such as a delay network structure or a gated convolutional neural network, etc. The model type and the topological structure are not particularly limited in the application, and various other effective novel model structures can be replaced. The estimation model (estimators) can use a feature matrix to compute the inner product between two inputs.

And S106, carrying out unsupervised training on the coding model, the extraction model and the estimation model according to the second audio, and adjusting model parameters of the extraction model and the estimation model.

Among them, the unsupervised training may also be called unsupervised learning, which is a way for a machine learning model to learn based on unlabeled sample data.

In one embodiment, unsupervised training of the coding model, the extraction model, and the estimation model based on the second audio, the adjusting of the model parameters of the extraction model and the estimation model, comprises: coding the audio features of the second audio through the coding model to obtain embedded features of the second audio; extracting the embedded features of the second audio through the extraction model to obtain abstract features of the target audio included in the second audio; processing the embedded characteristic of the second audio and the abstract characteristic of the target audio included in the second audio through an estimation model to obtain mutual information estimation characteristics between the second audio and the abstract characteristic of the target audio included in the second audio; constructing an unsupervised training loss function according to the mutual information estimation characteristics; and fixing the model parameters of the coding model, and adjusting the model parameters of the extraction model and the estimation model according to the direction of the minimum unsupervised training loss function.

The audio features are data obtained by processing physical information of audio. Physical information such as spectrum information, etc. The audio features may specifically be time-Frequency features, Gammatone power spectrum features, spectral amplitude features, Mel Frequency Cepstrum Coefficient (MFCC) features, and the like, where Gammatone is a feature that simulates the human ear and cochlea after filtering, and the like.

In one embodiment, the computer device may perform short-time fourier transform on the second audio to obtain a time-frequency point of the second audio; and acquiring time-frequency characteristics formed by the time-frequency points as audio characteristics of the second audio.

Specifically, the computer device may perform a Short-time fourier transform (STFT) on the second audio to obtain a Short-time fourier spectrum of the second audio

Where T denotes the number of frames in the time dimension, F denotes the number of frequency bands in the frequency dimension, and R denotes a real number.

In the embodiment of the present application, the short-time fourier spectrum of the second audio is used as input data (training samples) of the coding model. Then, a set of unlabeled training samples derived from a set of unlabeled second audio can be represented as: { X⁽¹⁾,X⁽²⁾,...,X^(L)E.g. χ. Each training sample may then be a set of time bins of the input space: { X ═ X_t,f}_{t＝1,...,T；f＝1,...,F}. Wherein, X_t,fMay be represented as a time bin of the f-th band in the t-th frame. The time-frequency features formed by the time-frequency points may be a real matrix with dimension T × F.

For a mixed audio in which a target audio and a noise audio are mixed, the time-frequency point X of the mixed audio can be considered to be formed by mixing the time-frequency point X of the target audio and the time-frequency point e of the noise audio. For example, X ═ X + e.

In addition, a set of unsupervised training samples derived from clean noise audio can be represented as: { X^(L+1),X^(L ⁺²⁾,...,X^(L+U)∈χ}。

In a specific embodiment, the sampling rate of the audio is 16kHz, i.e. 16k sample points per second. The short-time fourier transform takes a 25ms stft window length, a 10ms window shift and a 257 band number. That is, when audio is framed, the frame length is 25ms, the window is shifted by 10ms, and the frame number T is obtained, and F is 257.

In one embodiment, the computer device may map the low-dimensional audio features to a high-dimensional Embedding Space (Embedding Space) through the coding model, resulting in embedded (Embedding) features.

Specifically, the computer device may input a time-frequency point matrix (time-frequency feature) of the second audio, which is obtained by performing a short-time fourier transform on the second audio, into the coding model. And the coding model performs nonlinear operation on the input, embeds the input into an embedding space with D dimension, and obtains the embedding characteristic of the second audio in the embedding space.

For example, with continued reference to fig. 2, the coding model Encoder:

where θ is the model parameter of the coding model, D is the dimension of the embedding space, E_θRepresenting the computation process of mapping the input domain χ to the high-dimensional embedding space ν. The embedded features obtained by mapping the time-frequency features formed by a group of time-frequency points of the input space are real matrices with dimensions of T multiplied by F multiplied by D.

It should be noted that the input field

Representing the short-time fourier spectrum of the audio, T representing the number of frames in the time dimension, and F representing the number of frequency bands in the frequency dimension. The input of the coding model is a group of time frequency points (Gamma F) belonging to an input domain, the group of time frequency points can also be divided into T small groups according to frames, and each small group of time frequency points (Gamma F), namely the time frequency points of each frame of audio. Then, the embedding feature v of the output field may also be the embedding feature v corresponding to each frame including audio_tI.e. each frame of the second audio corresponds to an embedded feature.

In a specific embodiment, the coding model may be a 4-layer BilSTM structure with 600 nodes per hidden layer. The BiLSTM structure is followed by a full connection layer, and a 600-dimensional hidden vector is mapped to a 257 x 40-dimensional high-dimensional Embedding space. Where 257 is the number of STFT bands, i.e., T; 40 is the embedding spatial dimension, D.

In one embodiment, extracting the embedded feature of the second audio through the extraction model to obtain the abstract feature of the target audio included in the second audio includes: processing the embedded characteristics of the second audio by extracting the first hidden layer of the model to obtain the prediction probability of the time-frequency point of the second audio as the time-frequency point of the target audio; and calculating the embedded characteristics of the time frequency points and the prediction probability of the time frequency points according to the time sequence by extracting a second hidden layer of the model, and constructing the global abstract characteristics of the target audio included in the second audio.

Wherein, the hidden layer is a term in the network model, and is an intermediate layer relative to the input layer and the output layer. The hidden layer comprises model parameters obtained by training the network model. The hidden layer of the extraction model here is an intermediate layer with respect to the input layer of the extraction model and the output layer of the extraction model. All intermediate layers between the input layer and the output layer of the extraction model can be collectively referred to as hidden layers, or the intermediate layers can be divided, i.e. more than one hidden layers, such as a first hidden layer or a second hidden layer. The hidden layer of the extraction model may include more than one network structure. Each network structure may include one or more network layers. The hidden layer of the extraction model may be understood and described herein as a "black box".

Specifically, the first hidden layer of the extraction model processes the embedded features of the second audio, so that the prediction probability that each time-frequency point of the second audio is predicted as the time-frequency point of the target audio can be obtained. And extracting a second hidden layer of the model, and calculating the embedded characteristics of each time frequency point and the prediction probability of each time frequency point in time sequence to construct global abstract characteristics of the target audio included in the second audio.

For example, with continued reference to fig. 2, the extraction model Abstractor:

wherein the content of the first and second substances,

model parameters of the model are extracted.

And representing an operation process of converting the embedded characteristic upsilon into a probability matrix p and then obtaining an abstract characteristic c according to the embedded characteristic upsilon and the probability matrix p. p is a real matrix of dimension T × F. c is a real vector with dimension D × 1 or 1 × D.

It should be noted that the input of the coding model is a set of time-frequency points (tx rf) belonging to the input domain, and p is a real matrix with dimension tx F. Then, p may be a probability matrix composed of prediction probabilities corresponding to each of the Tx and F time-frequency points. The prediction probability represents the probability that the time frequency point is predicted as the time frequency point of the target audio.

In a specific embodiment, the extraction model may compute the global abstract features by the following formula:

wherein C belongs to C and is the global abstract characteristic of the target audio included in the second audio; upsilon belongs to v and is an embedded characteristic; p ∈ P, as the prediction probability, t denotes the frame index, and f denotes the band index.

Representing the element dot product.

In one embodiment, the extraction model may multiply equation (1) by a binary threshold matrix to reduce the impact of low energy noise, as follows:

wherein w ∈ R^TFRepresents the following binary threshold matrix:

it should be noted that, for simplicity of representation, the Embedding dimension index subscripts of c and υ are omitted from the formulas provided in the embodiments of the present application.

For example, as shown in fig. 3, in the unsupervised training phase, the time-frequency point of the second audio { X ═ X_t,f}_{t＝1...,T；f＝1...,F}Inputting an encoding (Encoder) model, and outputting an embedded characteristic { upsilon ] corresponding to each frame of second audio_t}_t＝1...,T，{υ_t}_t＝1...,TInputting an extraction (Abstract) model to obtain an intermediate result, and predicting probability { p) corresponding to each time-frequency point of the second audio_t,f}_{t＝1...,T；f＝1...,F}And outputs a global abstract feature c of the second audio. { upsilon_t}_t＝1...,TInputting the estimation (Estimator) model together with c, the unsupervised loss function (Unsu) can be constructed based on the output of the estimation (Estimator) modelpervised Loss)。

In a specific embodiment, the extraction model may specifically employ an autoregressive model, and construct a global abstract feature (which may be long-term, i.e., with lower temporal resolution) in time sequence based on a local Embedding feature (an embedded feature of the second audio current frame); alternatively, the extraction model may also use a Recurrent (recurrentness) model or a digest function to construct a global abstract feature based on the local Embedding feature.

In a specific example, the extraction model abstract includes a fully connected layer, and 257 x 40 dimensional hidden vectors are mapped to 600 dimensions and then input into a 2-layer BilSTM, and the node number of each hidden layer is 600.

In the embodiment, the extraction model extracts global, long-term stable and slow (low temporal resolution) abstract features from the embedded features through unsupervised learning, so that the features of the target audio hidden in the second audio can be more accurately described, and the subsequent voice separation by using the extraction model is more accurate.

It should be noted that the encoding model is to encode all input information into the embedded features, and the extraction model is to extract only abstract features of target information hidden in the input data, that is, abstract features of target audio included in the second audio.

In one embodiment, the computer device may estimate the second audio and the mutual information estimation characteristic of the abstract characteristic of the target audio included in the second audio by the estimation model according to the embedded characteristic of the second audio and the abstract characteristic of the target audio included in the second audio.

Wherein the mutual information estimation characteristic is a characteristic related to the mutual information. Mutual Information (Mutual Information) is a measure of Information that can be viewed as the amount of Information contained in one variable about another variable. It is understood that the mutual information cannot be estimated accurately, and the mutual information estimation characteristic can be expressed as an estimation of the mutual information between the second audio and the abstract characteristic of the target audio included in the second audio in this embodiment.

Specifically, the estimation model may combine the embedded feature of the second audio with the abstract feature of the target audio included in the second audio and then perform operation to obtain the mutual information estimation feature of the abstract feature of the target audio included in the second audio and the second audio. The joint concrete may be splicing, i.e. splicing between the embedded features of the second audio and the abstract features of the target audio comprised by the second audio.

For example, with continued reference to fig. 2, the estimation model estimators:

where ω is a model parameter of the estimation model, T_ωRepresents an operation procedure of estimating the mutual information estimation characteristic MI between the second audio and the abstract characteristic c of the target audio included in the second audio, and, in particular,

where g denotes a function that combines the embedded feature v and the abstract feature c, and MI is a real number.

In one embodiment, the estimation model Estimator specifically uses a weighting matrix ω ∈ R^40×40For calculating the inner product: t is_ω(ν,c)＝c^Tωc。

In the embodiment, mutual information between abstract features of target audio included in the mixed audio and the mixed audio is estimated by means of an estimation model, and then an unsupervised learning loss function can be constructed according to the mutual information based on the physical meaning of the mutual information, so that model training can be performed by using the constructed loss function.

In one embodiment, constructing an unsupervised training loss function based on mutual information estimation features comprises: dividing a first time frequency point predicted as a positive sample according to the prediction probability of each time frequency point; acquiring a second time frequency point serving as a negative sample; the second time frequency point is taken from the noise proposed distribution obeyed by the time frequency point of the pure noise audio frequency; and constructing an unsupervised training loss function according to the mutual information estimation characteristics corresponding to the first time frequency point and the mutual information estimation characteristics corresponding to the second time frequency point.

It should be noted that, in general, the voice separation task may be regarded as a binary task. Namely, the time frequency points of the audio to be separated are classified into positive samples, namely the time frequency points of the target audio; or as negative samples, i.e., time bins that are not the target audio. In the embodiment of the present application, a probability threshold may be set in advance. And when the prediction probability of the time frequency point reaches or exceeds the probability threshold, dividing the time frequency point into positive samples.

In addition, the time-frequency points of clean noisy audio follow the noise proposal distribution. It is understood that for a probability distribution p (x) that cannot be directly sampled, a common probability distribution q (x) can be constructed such that k x q (x) > p (x) is satisfied for all x; p (x) is then sampled using a reject sampling method, this q (x) being called proposed distribution (proposed distribution). The noise proposal distribution can then be viewed as a proposed distribution of probability distributions to which the noise audio is subject. The computer device may then obtain the second time-frequency point from the noise proposal distribution as a negative sample. And then constructing an unsupervised training loss function according to the mutual information estimation characteristics corresponding to the first time frequency point and the mutual information estimation characteristics corresponding to the second time frequency point.

In one particular embodiment, the formula for the unsupervised training loss function is as follows:

wherein f is_Θ(x,c)＝exp(T_ω(E_θ(x),c))；

An abstract representation representing the target audio comprised by the second audio. x denotes the time-frequency points predicted as positive samples, the distribution of these time-frequency points

Representing the joint distribution of x and c. The computer equipment can make the intermediate output of the extraction model, namely the prediction probability p of each time frequency pointIs an estimate of p (x, c). x' represents a proposed distribution of noise audio extracted from clean

As the time-frequency point of the negative sample. E_p(z) represents the expectation of calculating the variable z subject to the distribution p.

In further embodiments, x' may also be a set of time frequency points extracted from time frequency points predicted to be non-target audio and clean noise audio.

In the embodiment of the present application, the unsupervised training loss function may be named ImNICE (InfoMax Noise-Interference control Estimation, maximum mutual information comparison Estimation of Interference and Noise).

In this embodiment, an estimation (Estimator) model, which may also be referred to as a mutual information estimation (MI Estimator) model, is used to estimate mutual information between two data.

It will be appreciated that the joint probability distribution p (x, c) that needs to be used in the unsupervised training loss function can be used as an estimate of p (x, c) from the intermediate output of the extraction model, which is trained in the pre-training phase and the subsequent supervised training phase. It can be seen that a reliable joint probability distribution p (x, c) is efficiently estimated for unsupervised learning by supervised training. Wherein, the probability distribution obeyed by the prediction probability corresponding to each time-frequency point output by the middle layer of the extraction model can be used as the estimated value of p (x, c).

In one embodiment, the computer device may input the time-frequency points of the second audio into the coding model and the extraction model after pre-training the coding model and the extraction model, obtain prediction probabilities corresponding to the time-frequency points of the second audio, and partition the time-frequency points predicted as the target audio and the time-frequency points not predicted as the target audio according to the prediction probabilities of the time-frequency points. And taking the time-frequency point predicted as the target audio as a positive sample, and selecting a negative sample from the time-frequency point which is not the target audio and the time-frequency point of the net noise audio. And taking the probability distribution obeyed by the prediction probability corresponding to each time-frequency point output by the middle layer of the extraction model as the estimated value of p (x, c) in subsequent unsupervised learning. In this way, the division of the samples and the determination of the joint probability distribution are extracted outside the unsupervised training iteration process, which can reduce the amount of computation per iteration, but convergence may be slow.

In one embodiment, the computer device may input the time-frequency points of the second audio into the coding model and the extraction model during the unsupervised training, obtain prediction probabilities corresponding to the time-frequency points of the second audio, and partition the time-frequency points predicted as the target audio and the time-frequency points not predicted as the target audio according to the prediction probabilities of the time-frequency points. And taking the time-frequency point predicted as the target audio as a positive sample, and selecting a negative sample from the time-frequency point which is not the target audio and the time-frequency point of the net noise audio. And taking the probability distribution obeyed by the prediction probability corresponding to each time-frequency point output by the middle layer of the extraction model as the estimated value of p (x, c) in the iteration. In this way, the division of the samples and the determination of the joint probability distribution are performed inside the unsupervised training iteration process, which can improve the convergence rate, but brings more calculation amount for each iteration.

In the embodiment, the unsupervised training function is constructed by means of the physical significance of mutual information, and unsupervised learning is performed by using the distinguishing capability learned in supervised learning, so that the unsupervised learning and the supervised learning are effectively combined, mutual optimization is promoted, and the model training efficiency and effect are improved.

Further, the computer device may fix model parameters of the coding model, and adjust the model parameters of the extraction model and the estimation model in a direction that minimizes an unsupervised training loss function.

In the above embodiment, a large amount of unlabeled second audio is used for unsupervised training, the coding model parameters are fixed and are not updated in the unsupervised learning stage, only the model parameters of the extraction model and the estimation model are updated, the abstract features can be calculated based on the stable distinguishing embedded feature space constructed in the previous stage of pre-training, and the extraction capability in the unsupervised process is optimized by using the capability obtained by supervised learning, so that the abstract features with robustness and generalization are extracted from the interfered mixed signal for the hidden information.

In a specific embodiment, the computer device may set the size of the batch data to be 32, the initial learning rate to be 0.0001, the weight reduction coefficient of the learning rate to be 0.8, the number of nodes of the output layer of the coding model (Enoder) to be 40, the number of random down-sampling frames per segment of audio to be 32, and the number of negative samples corresponding to each positive sample in the formula (1) to be 63. The probability threshold for the positive sample prediction probability is 0.5.

And S108, performing supervised training on the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjusting model parameters of the coding model.

Among them, supervised training may also be called supervised learning, which is a way for a machine learning model to learn based on labeled sample data. In the embodiment of the application, supervised learning and unsupervised learning share the same encoding (Encoder) model and extraction (abstract) model.

In one embodiment, the supervised training of the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio includes: coding the audio features of the first audio through the coding model to obtain embedded features of the first audio; extracting the embedded characteristics of the first audio through an extraction model to obtain abstract characteristics of a target audio included in the first audio; constructing a supervised training loss function according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio; and fixing the model parameters of the extracted model, and adjusting the model parameters of the coding model according to the direction of the minimum supervised training loss function.

In one embodiment, the computer device may perform a fourier transform on the first audio resulting in audio features of the first audio. For example, performing short-time fourier transform on the first audio to obtain a time-frequency point of the first audio; and acquiring time-frequency characteristics formed by the time-frequency points as audio characteristics of the first audio.

In particular, the computer device may perform a Short-time fourier transform (STFT) on the first audio resulting in a Short-time fourier spectrum of the first audio

In the embodiment of the present application, a short-time fourier spectrum of the first audio is used as input data (training samples) of the coding model. Then, a set of labeled training samples derived from a set of labeled first audios may be represented as: { X^(L+U+1),X^(L+U+2),...,X^(L+U+N)E.g. χ. Each training sample may then be a set of time bins of the input space: { X ═ X_t,f}_{t＝1,...,T；f＝1,...,F}. Wherein, X_t,fMay be represented as a time bin of the f-th band in the t-th frame. The time-frequency features formed by the time-frequency points may be a real matrix with dimension T × F.

In further embodiments, the computer device may also calculate a Gammatone power spectral feature, a spectral magnitude feature, or a Mel Frequency Cepstrum Coefficient (MFCC) feature, etc., of the first audio as the audio feature of the first audio.

Specifically, the computer device may input a time-frequency point matrix (time-frequency feature) of the first audio, which is obtained by performing a short-time fourier transform on the first audio, into the coding model. And the coding model carries out nonlinear operation on the input, and embeds the input into an embedding space with D dimension to obtain the embedding characteristic of the first audio in the embedding space.

For example, the coding model Encoder:

where θ is the model parameter of the coding model, D is the dimension of the embedding space, E_θRepresenting the computation process of mapping the input domain χ to the high-dimensional embedding space ν. The embedded features mapped by the time-frequency features constituted by a group of time-frequency points of the input space have dimensions of T multiplied by F multiplied by DA matrix of real numbers. It should be noted that the input field

Representing the short-time fourier spectrum of the audio, T representing the number of frames in the time dimension, and F representing the number of frequency bands in the frequency dimension. The input of the coding model is a group of time frequency points (Gamma F) belonging to an input domain, the group of time frequency points can also be divided into T small groups according to frames, and each small group of time frequency points (Gamma F), namely the time frequency points of each frame of audio. Then, the embedding feature v of the output field may also be the embedding feature v corresponding to each frame including audio_tI.e. each frame of the second audio corresponds to an embedded feature, which may also be referred to as a time-varying embedded feature.

In one embodiment, the computer device may process the embedded feature by extracting a first hidden layer of the model to obtain a prediction probability that a time-frequency point of the first audio is a time-frequency point of the target audio; and calculating the embedding characteristics of the time frequency points and the prediction probability of the time frequency points by extracting a second hidden layer of the model, and constructing the time-varying abstract characteristics of the target audio included in the first audio.

Specifically, the first hidden layer of the extraction model processes the embedded features of the first audio, so that the prediction probability that each time-frequency point of the first audio is predicted as the time-frequency point of the target audio can be obtained. And extracting a second hidden layer of the model, and then calculating the embedded characteristics of each time frequency point and the prediction probability of each time frequency point to construct the time-varying abstract characteristics of the target audio included in the first audio.

For example, the extraction model Abstractor:

wherein the content of the first and second substances,

model parameters of the model are extracted.

Expressing that the embedded characteristic upsilon is converted into a probability matrix p, and then obtaining extraction according to the embedded characteristic upsilon and the probability matrix pLike the operation of feature c. p is a real matrix of dimension T × F. c is a real vector with dimension D × 1 or 1 × D. It should be noted that the input of the coding model is a set of time-frequency points (tx rf) belonging to the input domain, and p is a real matrix with dimension tx F. Then, p may be a probability matrix composed of prediction probabilities corresponding to each of the Tx and F time-frequency points. The prediction probability represents the probability that the time frequency point is predicted as the time frequency point of the target audio.

In one particular embodiment, the extraction model may compute time-varying abstract features by the following formula:

wherein, c_tE, C, the abstract feature of the tth frame of the target audio included in the first audio, namely the time-varying abstract feature of the target audio included in the first audio; upsilon is_tE.v, as an embedding characteristic; p is a radical of_tE is P and is used as the prediction probability; t denotes a frame index, and f denotes a band index;

representing the element dot product.

In one embodiment, the extraction model may multiply equation (5) by a binary threshold matrix to reduce the impact of low energy noise, as follows:

wherein w ∈ R^TFThe binary threshold matrix is represented by the same equation (3) as in the previous embodiment.

For example, as shown in fig. 4, in the supervised training phase, the time-frequency point of the first audio { X ═ X_t,f}_{t＝1...,T；f＝1...,F}Inputting an encoding (Encoder) model, and outputting an embedded characteristic { upsilon ] corresponding to each frame of first audio_t}_t＝1...,T，{υ_t}_t＝1...,TInputting an extraction (Abstract) model to obtain an intermediate result, and predicting probability { p) corresponding to each time-frequency point of the first audio_t,f}_{t＝1...,T；f＝1...,F}And outputting the time-varying abstract feature { c) of the first audio_t}_t＝1...,T. Based on { X ═ X_t,f}_{t＝1...,T；f＝1...,F}、{υ_t}_t＝1...,TAnd { c_t}_t＝1...,TA supervised loss function (SupervisedLoss) can be constructed.

In a specific embodiment, the extraction model may specifically adopt an autoregressive model, and construct a time-varying abstract feature based on a local Embedding feature (an embedded feature of each frame of the second audio); alternatively, the extraction model may also use a Recurrent (recurrentness) model or a digest function to construct time-varying abstract features based on local Embedding features (embedded features of frames of the second audio).

In this embodiment, the extraction model extracts the time-domain and high-time-domain-resolution abstract features from the embedded features through supervised learning, and can reconstruct the frequency spectrum of the target audio in the mixed audio more accurately, so as to perform the supervised learning.

In one embodiment, the computer device may determine a spectral mask of the target audio included in the first audio based on the embedded features of the first audio and the abstract features of the target audio included in the first audio; reconstructing the target audio based on the spectral mask; and constructing a pre-training coding model and an extraction model with a supervised training loss function according to the difference between the reconstructed target audio and the labeled audio of the first audio.

Wherein a spectrum Mask (Mask) is used to separate a spectrum of audio included therein from the mixed audio. For example, assuming that a mixed audio (mixed speech) includes a target speaker, i.e., a target speaker, corresponding to speech 1 and speech 1 corresponds to a spectrum mask (mask1, abbreviated as M1), the speech spectrum corresponding to speech 1 can be obtained by multiplying the spectrum of the mixed audio by M1.

In particular, in supervised training, the computer device may reconstruct a type of objective function as a supervised training loss function, by which the supervised training model can ensure to some extent that the intermediately learned feature is an encoding of the target audio. This also demonstrates the reason for efficiently estimating reliable joint probability distributions for unsupervised training by incorporating the discriminative learning capabilities of supervised training.

In a specific embodiment, the supervised training loss function may specifically be an MSE (Mean Square Error) between the spectrum of the estimated target audio and the spectrum of the labeled audio:

where θ and ψ are model parameters, c_tIs a time-varying abstract feature calculated by equation (6), upsilon_tIs a time-varying embedded feature.

Is the frequency spectrum of the reconstructed target audio, and x is the frequency spectrum of the labeled audio. Supervised learning based on the MSE loss function can effectively utilize labeled training data to regularly distinguish the embedding feature space.

In further embodiments, other reconstruction types of objective functions may be employed by the supervised training function. Such as Scale-invariant signal-to-noise ratio (SI-SNR) objective functions, etc.

Further, the computer device may fix model parameters of the extraction model and the estimation model, and adjust the model parameters of the coding model in a direction that minimizes the supervised training penalty function.

In the above embodiment, the first audio with labels is used for supervised training, the model extraction and estimation parameters are fixed and are not updated in the supervised learning stage, and only the model parameters of the coding model are updated, so that the embedding features with embedding space distinctiveness can be further finely adjusted based on the more robust and general abstract features obtained in the unsupervised training in the previous stage.

And S110, continuing to perform the unsupervised training and the supervised training so that the unsupervised training and the supervised training are performed in an overlapping mode until the training stopping condition is met, and ending the training.

It can be understood that, on one hand, supervised learning can effectively utilize labeled data to regularly distinguish an embedded feature space, but is limited by the problems of data efficiency, robustness, generalization and the like; unsupervised learning, on the other hand, is a powerful learning method that improves robustness and generalization through unlabeled data. In the embodiment of the application, a model training mode of overlapping supervised-unsupervised learning (ASU) is provided, so that two mechanisms of supervised-unsupervised updating of a shared network model in the same architecture are alternately and overlappingly provided.

Specifically, a coding (Encoder) model and an extraction (Abstract) model are obtained through training in a pre-training stage, and a relatively stable distinguishing Embedding (Embedding) feature space is constructed through the coding (Encoder) model; the subsequent unsupervised learning and supervised learning processes then proceed overlappingly until the model converges.

And in the unsupervised learning stage, the model parameters of the coding (Encoder) model are fixed and not updated in the process, and only the model parameters of the extraction (abstract) model and the estimation (Estimator) model are updated, so that the abstract features are calculated based on the stable distinguishing Embedding (Embedding) feature space constructed in the last stage. And in the supervised learning stage, the extraction (abstract) model and the estimation (Estimator) model are fixed and not updated in the process, and only the model parameters of the coding (Encoder) model are updated, so that the feature space of the partition Embedding (Embedding) is further finely adjusted based on the more robust and universal abstract features obtained in the last stage.

In one embodiment, the computer device may partition a portion of the first audio as test data, and may stop training when the MSE loss over the test data does not improve for a predetermined number of consecutive iterations. And according to actual training and testing, the time required by the overlapping training of the unsupervised learning phase and the supervised learning phase in the overlapping supervised-unsupervised learning (ASU) process is far shorter than the pre-training time. This is because the overlap phase is mainly fine-tuned based on the pre-trained model, so convergence can be reached quickly.

In one embodiment, an intuitive prominent guided selection mechanism may be employed. Namely, in the model training stage, the speaker voice with the largest energy is selected as the target audio. In the model using stage, the model can automatically select and track the target speaker voice with the maximum energy without providing any target clue. The Training mode can be replaced by other alternative schemes, typically, a Permutation invariance Training method (PIT). PIT extracts by computing all possible abstract features corresponding to the target speaker's speech and interfering signals

Determines the correct output permutation by the lowest value of the objective function in the permutations of:

the model training method provides a model training mode combining unsupervised learning and supervised learning in an overlapping mode, combines an estimation model on the basis of a pre-trained coding model and an extraction model, and optimizes model parameters of the extraction model and the estimation model by using an unsupervised training sample unsupervised training coding model, the extraction model and the estimation model; and optimizing the model parameters of the coding model by using the marked training sample supervised training coding model and the extraction model, and overlapping the unsupervised training and the supervised training until finishing the training. Therefore, the representation capability learned by unsupervised learning and the distinguishing capability learned by supervised learning are mutually optimized in iteration, so that the effect of the trained coding model and the trained extraction model is better when speech is separated, only a small amount of labeled samples are needed in the model training process, and the cost is greatly reduced.

In one embodiment, the speech separation model training method further comprises: pre-training the coding model and extracting the model. The method specifically comprises the following steps: fourier transform is carried out on the first audio frequency to obtain audio frequency characteristics of the first audio frequency; coding the audio features through a coding model to obtain embedded features of the first audio; extracting the embedded features through an extraction model to obtain abstract features of a target audio included in the first audio; and constructing a supervised training loss function pre-training coding model and an extraction model according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio.

In the embodiment of the application, the mode of pre-training the coding model and extracting the model is supervised training. The supervised training process is similar to the process of S108, but the model parameters of both the pre-training phase coding model and the extraction model are updated.

In one embodiment, fourier transforming the first audio to obtain audio features of the first audio comprises: carrying out short-time Fourier transform on the first audio to obtain time-frequency points of the first audio; and acquiring time-frequency characteristics formed by the time-frequency points as audio characteristics of the first audio.

For example, the coding model Encoder:

where θ is the model parameter of the coding model, D is the dimension of the embedding space, E_θRepresenting the computation process of mapping the input domain χ to the high-dimensional embedding space ν. The embedded features obtained by mapping the time-frequency features formed by a group of time-frequency points of the input space are real matrices with dimensions of T multiplied by F multiplied by D. It should be noted that the input field

In one embodiment, extracting the embedded features through an extraction model to obtain abstract features of a target audio included in the first audio includes: processing the embedded features through a first hidden layer of the extraction model to obtain the prediction probability of the time-frequency point of the first audio frequency as the time-frequency point of the target audio frequency; and calculating the embedding characteristics of the time frequency points and the prediction probability of the time frequency points by extracting a second hidden layer of the model, and constructing the time-varying abstract characteristics of the target audio included in the first audio.

For example, the extraction model Abstractor:

wherein the content of the first and second substances,

model parameters of the model are extracted.

And representing an operation process of converting the embedded characteristic upsilon into a probability matrix p and then obtaining an abstract characteristic c according to the embedded characteristic upsilon and the probability matrix p. p is a real matrix of dimension T × F. c is a real vector with dimension D × 1 or 1 × D. It should be noted that the input of the coding model is a set of time-frequency points (tx rf) belonging to the input domain, and p is a real matrix with dimension tx F. Then, p may be a probability matrix composed of prediction probabilities corresponding to each of the Tx and F time-frequency points. The prediction probability represents the probability that the time frequency point is predicted as the time frequency point of the target audio.

In a specific embodiment, the extraction model may calculate the time-varying abstract features by equation (5) or equation (6) above.

In a specific embodiment, the extraction model may specifically adopt an autoregressive model, and construct a time-varying abstract feature based on a local Embedding feature (an embedded feature of the second audio current frame); alternatively, the extraction model may also adopt a Recurrent (recurrentness) model or a digest function, and construct a time-varying abstract feature based on the local Embedding feature.

In one embodiment, constructing the supervised training loss function pre-training encoding model and the extraction model based on the annotated audio of the first audio, the embedded features of the first audio, and the abstract features of the target audio included in the first audio comprises: determining a spectral mask of a target audio included in the first audio according to the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio; reconstructing the target audio based on the spectral mask; and constructing a pre-training coding model and an extraction model with a supervised training loss function according to the difference between the reconstructed target audio and the labeled audio of the first audio.

Wherein the reconstructed is the spectrum of the target audio. In particular, in supervised training, the computer device may reconstruct a type of objective function as a supervised training loss function, by which the supervised training model can ensure to some extent that the intermediately learned feature is an encoding of the target audio. This also demonstrates the reason for efficiently estimating reliable joint probability distributions for unsupervised training by incorporating the discriminative learning capabilities of supervised training.

In a specific embodiment, the supervised training loss function may specifically be to estimate the MSE (Mean Square Error) between the frequency spectrum of the target audio and the frequency spectrum of the annotated audio as in equation (7) above.

Further, the computer device may adjust model parameters of the coding model and the extraction model in a direction that minimizes the supervised training penalty function.

In the above embodiment, the first audio with the label is used for supervised training, the coding model and the extraction model are pre-trained, a relatively stable distinguishing embedded feature space is constructed through the coding model, and the reliable joint probability distribution is effectively estimated based on the stable distinguishing embedded feature space for subsequent unsupervised learning.

In a specific embodiment, the labeled first audio is a labeled mix signal (mixturesamples), and the unlabeled second audio includes an unlabeled mix signal and a net noise signal.

The computer equipment can obtain the frequency spectrum { X ] of the marked mixed signal when jointly training the coding model, the extraction model and the estimation model^(L+U+1),X^(L+U+2),...,X^(L+U+N)E X, spectrum X of the unlabeled mixed signal⁽¹⁾,X⁽²⁾,...,X^(L)E χ and the spectrum of the net noise signal X^(L+1),X^(L+2),...,X^(L+U)E.g. χ. The time-frequency points { X ═ X of these spectra_t,f}_{t＝1,...,T；f＝1,...,F}As input data. Wherein T represents the number of frames in the time dimension, F represents the number of frequency bands in the frequency dimension, X_t,fMay be represented as a time bin of the f-th band in the t-th frame. For example, the mixed signal may specifically adopt a sampling rate of 16KHz, and the spectrum of the signal may specifically adopt a window length of 25ms STFT, a window shift of 10ms, and a number of 257 STFT frequency bands.

For example, the size of the batch data may be set to 32, the initial learning rate may be set to 0.0001, and the weight reduction coefficient of the learning rate may be set to 0.8.

In the pre-training stage, the computer device may divide the labeled mixed signals into more than one batch of mixed signals according to the size of the batch data. For each batch of (reach batch) mixed signal time-frequency points, the coding model and the extraction model are input, and then calculation is performed based on the formula (7)

The model parameters (theta, psi) of the coding model encoder and the extraction model abstrator are updated until the model converges.

The computer device then calculates the prediction probability p by means of the pre-trained extraction model, and divides the time-frequency points with the labeled mixed signals and the non-labeled mixed signals into time-frequency point positive samples and time-frequency point negative samples.

Wherein, gamma is⁺And Γ^-Is a probability threshold. For example, Γ⁺＝0.5。

In the overlap training phase, the computer device may divide the mixed signal into more than one batch of mixed signals according to the size of the batch data. For time frequency point positive sample

Random slave noise interference joint set

K time-frequency point negative samples are selected. Thereby based on the above formula (4)

Updating the model parameters (psi, omega) of the extracted model abstrator and the mutual information estimation model MI estimator, based on the above equation (7), based on

And updating the model parameter theta of the coding model encoder until the model converges.

The node number of the output layer of the Encoder is set to be 40, the random down-sampling frame number of each section of mixed signal is 32, and the number of the negative samples corresponding to each positive sample in the formula (4) is 63. The operation of dividing into time frequency point positive samples and time frequency point negative samples can be outside the iteration of the overlapping training stage or inside the iteration of the overlapping training stage. The difference between the two is that the former has smaller calculation amount per iteration but can have slower convergence; the latter is more computationally intensive per iteration but converges faster. When the MSE loss of the model does not improve for 3 consecutive training iterations, the training may be considered to converge and end.

In one embodiment, the speech separation model training method further includes a model using step, and the model using step specifically includes: acquiring mixed audio to be subjected to voice separation; processing the audio features of the mixed audio through an encoding model obtained after the unsupervised training and the supervised training are overlapped to obtain the embedded features of the mixed audio; processing the embedded characteristics of the mixed audio through an extraction model obtained after the overlapping of the unsupervised training and the supervised training is finished to obtain the abstract characteristics of the target audio included in the mixed audio; and reconstructing the target audio in the mixed audio according to the embedded characteristic of the mixed audio and the abstract characteristic of the target audio included in the mixed audio.

Wherein the mixed audio to be subjected to voice separation is audio mixed with the target audio. The target audio may specifically be a target speaker voice. The mixed audio may specifically be audio recorded in a conversation scene of more than one speaker, or speaker voice recorded in a noisy environment, etc.

For example, FIG. 5 shows a schematic diagram of a speech separation scenario in one embodiment. Referring to FIG. 5, more than one speaker is included in the diagram. When these speakers are in conversation, audio is collected through a far-field microphone, resulting in mixed audio. The far-field microphone transmits the collected audio data to the computer equipment, and the computer equipment obtains the mixed audio to be subjected to voice separation.

Referring to fig. 6, specifically, after the computer device obtains the mixed audio to be subjected to voice separation, the computer device may perform short-time fourier transform on the mixed audio to obtain a short-time fourier spectrum of the mixed audio; and then inputting the time-frequency points of the short-time Fourier spectrum into a coding model obtained after the overlapping of the unsupervised training and the supervised training is finished, outputting the embedded characteristics of the mixed audio by the coding model, inputting the embedded characteristics of the mixed audio into an extraction model obtained after the overlapping of the unsupervised training and the supervised training is finished, and outputting the abstract characteristics of the target audio included in the mixed audio by the extraction model. The computer equipment generates a spectrum mask of the target audio in the mixed audio according to the embedded characteristic of the mixed audio and the abstract characteristic of the target audio included in the mixed audio; and then, obtaining the frequency spectrum of the target audio according to the short-time Fourier spectrum of the mixed audio, thereby separating the target audio.

In this embodiment, the coding model and the extraction model obtained after the unsupervised training and the supervised training are overlapped can effectively extract the features of the robust and generalizable hidden signal from the mixed signal, thereby being more beneficial to separating the hidden signal from the mixed signal.

In addition, under various interference environments and under various signal-to-noise ratio conditions, the noise interference of the music background sound, the interference of other speakers and the background noise of 0dB-20dB are included. The present application tests against other existing methods that utilize unsupervised learning. The test result shows that the model training mode provided by the application is superior to the existing methods in terms of voice separation performance, including indexes such as tested voice Quality Perception Evaluation (PESQ), Short-term objective intelligibility (STOI) and Signal-to-distortion ratio (SDR), and stability. Moreover, the model Training method provided by the present application can automatically learn the characteristics of the target audio included in the mixed audio (e.g., the characteristics of the voice of the target speaker hidden in the mixed signal), and does not require additional arrangement invariance Training (PIT) processing, speaker tracking mechanism, or expert-defined processing and adjustment.

In the embodiment of the application, based on the provided model training method, the trained coding model and the trained extraction model, the characteristics of a robust and generalizable hidden signal can be effectively learned from a mixed signal with interference. In addition, the embodiment of the application can mine a large amount of unlabeled data in a real industrial application scene, and the more mismatched data scenes of model training and data scenes actually used by the model, the more obvious the advantages of the training modes of overlapped supervised learning and unsupervised learning provided by the embodiment of the application are.

The coding model and the extraction model obtained by training in the embodiment of the application can be well applied to the separation of single-channel speech, and the typical cocktail problem can be well solved.

It should be understood that, although the steps in the flowcharts of the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the above embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the sub-steps or the stages of other steps.

As shown in FIG. 7, in one embodiment, a speech separation model training apparatus 700 is provided. Referring to fig. 7, the speech separation model training apparatus 700 includes: an acquisition module 701, a first training module 702, a second training module 703, and an overlap module 704.

An obtaining module 701, configured to obtain a first audio and a second audio; the first audio comprises a target audio and a corresponding annotated audio; the second audio comprises noise audio; acquiring a coding model, an extraction model and an initial estimation model; wherein, the output of the coding model is the input of the extraction model; the output of the coding model and the output of the extraction model are jointly the input of the estimation model; the coding model and the extraction model are jointly used for speech separation.

A first training module 702, configured to perform unsupervised training on the coding model, the extraction model, and the estimation model according to the second audio, and adjust model parameters of the extraction model and the estimation model.

The second training module 703 is configured to perform supervised training on the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjust model parameters of the coding model.

And an overlap module 704, configured to continue the unsupervised training and the supervised training, so that the unsupervised training and the supervised training overlap each other, and the training is ended until the training stop condition is met.

As shown in fig. 8, in one embodiment, the speech separation model training apparatus 700 further includes: the pre-training module 705 is configured to perform fourier transform on the first audio to obtain audio features of the first audio; coding the audio features through a coding model to obtain embedded features of the first audio; extracting the embedded features through an extraction model to obtain abstract features of a target audio included in the first audio; and constructing a supervised training loss function pre-training coding model and an extraction model according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio.

In one embodiment, the pre-training module 705 is further configured to perform short-time fourier transform on the first audio to obtain time-frequency points of the first audio; and acquiring time-frequency characteristics formed by the time-frequency points as audio characteristics of the first audio.

In one embodiment, the pre-training module 705 is further configured to process the embedded feature by extracting a first hidden layer of the model, so as to obtain a prediction probability that a time-frequency point of the first audio is a time-frequency point of the target audio; and calculating the embedding characteristics of the time frequency points and the prediction probability of the time frequency points by extracting a second hidden layer of the model, and constructing the time-varying abstract characteristics of the target audio included in the first audio.

In one embodiment, the pre-training module 705 is further configured to determine a spectral mask of the target audio included in the first audio according to the embedded features of the first audio and the abstract features of the target audio included in the first audio; reconstructing the target audio based on the spectral mask; and constructing a pre-training coding model and an extraction model with a supervised training loss function according to the difference between the reconstructed target audio and the labeled audio of the first audio.

In one embodiment, the first training module 702 is further configured to encode the audio feature of the second audio through the encoding model to obtain an embedded feature of the second audio; extracting the embedded features of the second audio through the extraction model to obtain abstract features of the target audio included in the second audio; processing the embedded characteristic of the second audio and the abstract characteristic of the target audio included in the second audio through an estimation model to obtain mutual information estimation characteristics between the second audio and the abstract characteristic of the target audio included in the second audio; constructing an unsupervised training loss function according to the mutual information estimation characteristics; and fixing the model parameters of the coding model, and adjusting the model parameters of the extraction model and the estimation model according to the direction of the minimum unsupervised training loss function.

In one embodiment, the first training module 702 is further configured to process the embedded feature of the second audio by extracting the first hidden layer of the model, so as to obtain a prediction probability that the time-frequency point of the second audio is the time-frequency point of the target audio; and calculating the embedded characteristics of the time frequency points and the prediction probability of the time frequency points according to the time sequence by extracting a second hidden layer of the model, and constructing the global abstract characteristics of the target audio included in the second audio.

In one embodiment, the first training module 702 is further configured to partition a first time-frequency point predicted as a positive sample according to the prediction probability of each time-frequency point; acquiring a second time frequency point serving as a negative sample; the second time frequency point is taken from the noise proposed distribution obeyed by the time frequency point of the pure noise audio frequency; and constructing an unsupervised training loss function according to the mutual information estimation characteristics corresponding to the first time frequency point and the mutual information estimation characteristics corresponding to the second time frequency point.

In one embodiment, the second training module 703 is further configured to encode the audio feature of the first audio through the coding model to obtain an embedded feature of the first audio; extracting the embedded characteristics of the first audio through an extraction model to obtain abstract characteristics of a target audio included in the first audio; constructing a supervised training loss function according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio; and fixing the model parameters of the extracted model, and adjusting the model parameters of the coding model according to the direction of the minimum supervised training loss function.

As shown in fig. 9, in one embodiment, the speech separation model training apparatus 700 further includes: a use module 706, configured to obtain mixed audio to be subjected to voice separation; processing the audio features of the mixed audio through an encoding model obtained after the unsupervised training and the supervised training are overlapped to obtain the embedded features of the mixed audio; processing the embedded characteristics of the mixed audio through an extraction model obtained after the overlapping of the unsupervised training and the supervised training is finished to obtain the abstract characteristics of the target voice included in the mixed audio; and reconstructing the target voice in the mixed audio according to the embedded characteristic of the mixed audio and the abstract characteristic of the target audio included in the mixed audio.

In one embodiment, the first audio and the second audio are single channel audio; the first audio is mixed audio including a target audio; the marked audio of the first audio is pure target audio; the second audio includes clean noise audio and mixed audio including the noise audio.

FIG. 10 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 or the server 120 in fig. 3. As shown in fig. 10, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a speech separation model training method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a speech separation model training method. Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the speech separation model training apparatus provided herein may be implemented in the form of a computer program that is executable on a computer device such as the one shown in fig. 10. The memory of the computer device may store various program modules constituting the speech separation model training apparatus, such as an acquisition module 701, a first training module 702, a second training module 703 and an overlap module 704 shown in fig. 7. The program modules constitute computer programs that cause the processors to execute the steps of the speech separation model training methods of the embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 10 may perform the acquiring of the first audio and the second audio by the acquiring module 701 in the speech separation model training apparatus shown in fig. 7; the first audio comprises a target audio and a corresponding annotated audio; the second audio comprises noise audio; acquiring a coding model, an extraction model and an initial estimation model; wherein, the output of the coding model is the input of the extraction model; the output of the coding model and the output of the extraction model are jointly the input of the estimation model; the coding model and the extraction model are jointly used for speech separation. The steps of unsupervised training of the coding model, the extraction model and the estimation model based on the second audio, adjusting the model parameters of the extraction model and the estimation model are performed by the first training module 702. The second training module 703 is used to perform supervised training on the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and adjust the model parameters of the coding model. The step of continuing the unsupervised training and the supervised training to overlap the unsupervised training and the supervised training until the training stop condition is satisfied is performed by the overlap module 704.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described speech separation model training method. Here, the steps of the speech separation model training method may be the steps of the speech separation model training methods of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, causes the processor to carry out the steps of the above-mentioned speech separation model training method. Here, the steps of the speech separation model training method may be the steps of the speech separation model training methods of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of speech separation model training, comprising:

acquiring a coding model, an extraction model and an initial estimation model;

2. The method of claim 1, further comprising:

performing Fourier transform on the first audio to obtain audio characteristics of the first audio;

coding the audio features through a coding model to obtain embedded features of the first audio;

extracting the embedded features through an extraction model to obtain abstract features of target audio included in the first audio;

and constructing a supervised training loss function to pre-train the coding model and the extraction model according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio.

3. The method of claim 2, wherein the fourier transforming the first audio to obtain the audio features of the first audio comprises:

carrying out short-time Fourier transform on the first audio to obtain a time-frequency point of the first audio;

and acquiring time-frequency characteristics formed by the time-frequency points as audio characteristics of the first audio.

4. The method according to claim 3, wherein the extracting the embedded feature through the extraction model to obtain the abstract feature of the target audio included in the first audio comprises:

processing the embedded features through a first hidden layer of the extraction model to obtain the prediction probability of the time-frequency point of the first audio frequency as the time-frequency point of the target audio frequency;

and calculating the embedding characteristics of the time frequency points and the prediction probability of the time frequency points through a second hidden layer of the extraction model, and constructing the time-varying abstract characteristics of the target audio included in the first audio.

5. The method of claim 2, wherein constructing the supervised training loss function pre-training the coding model and the extraction model based on the annotated audio of the first audio, the embedded features of the first audio, and the abstract features of the target audio included in the first audio comprises:

determining a spectral mask of a target audio included in the first audio according to the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio;

reconstructing the target audio based on the spectral mask;

constructing a supervised training loss function pre-training the coding model and the extraction model according to a difference between the reconstructed target audio and the annotated audio of the first audio.

6. The method of claim 1, wherein the unsupervised training of the coding model, the extraction model, and the estimation model based on the second audio, the adjusting of model parameters of the extraction model and the estimation model, comprises:

coding the audio features of the second audio through the coding model to obtain embedded features of the second audio;

extracting the embedded features of the second audio through the extraction model to obtain abstract features of the target audio included in the second audio;

processing the embedded characteristic of the second audio and the abstract characteristic of the target audio included in the second audio through the estimation model to obtain mutual information estimation characteristics between the second audio and the abstract characteristic of the target audio included in the second audio;

constructing an unsupervised training loss function according to the mutual information estimation characteristics;

and fixing the model parameters of the coding model, and adjusting the model parameters of the extraction model and the estimation model according to the direction of minimizing the unsupervised training loss function.

7. The method according to claim 6, wherein the extracting the embedded feature of the second audio through the extraction model to obtain the abstract feature of the target audio included in the second audio comprises:

processing the embedded characteristics of the second audio through a first hidden layer of the extraction model to obtain the prediction probability of the time frequency point of the second audio as the time frequency point of the target audio;

and calculating the embedded characteristics of the time frequency points and the prediction probability of the time frequency points according to time sequence through a second hidden layer of the extraction model, and constructing the global abstract characteristics of the target audio included in the second audio.

8. The method of claim 7, wherein constructing an unsupervised training loss function based on the mutual information estimation features comprises:

dividing a first time frequency point predicted as a positive sample according to the prediction probability of each time frequency point;

acquiring a second time frequency point serving as a negative sample; the second time frequency point is taken from the noise proposed distribution obeyed by the time frequency point of the pure noise audio frequency;

and constructing an unsupervised training loss function according to the mutual information estimation characteristics corresponding to the first time frequency point and the mutual information estimation characteristics corresponding to the second time frequency point.

9. The method of claim 1, wherein the supervised training of the coding model and the extraction model according to the first audio and the labeled audio corresponding to the first audio, and the adjusting of the model parameters of the coding model comprises:

coding the audio features of the first audio through the coding model to obtain embedded features of the first audio;

extracting the embedded features of the first audio through the extraction model to obtain abstract features of a target audio included in the first audio;

constructing a supervised training loss function according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio;

and fixing the model parameters of the extraction model, and adjusting the model parameters of the coding model according to the direction of minimizing the supervised training loss function.

10. The method according to any one of claims 1 to 9, further comprising:

acquiring mixed audio to be subjected to voice separation;

processing the audio features of the mixed audio through a coding model obtained after the unsupervised training and the supervised training are overlapped to obtain the embedded features of the mixed audio;

processing the embedded features of the mixed audio through an extraction model obtained after the unsupervised training and the supervised training are overlapped to obtain abstract features of target voice included in the mixed audio;

and reconstructing the target voice in the mixed audio according to the embedded characteristics of the mixed audio and the abstract characteristics of the target audio included in the mixed audio.

11. The method of any of claims 1-9, wherein the first audio and the second audio are single channel audio; the first audio is mixed audio including target audio; the marked audio of the first audio is the pure target audio; the second audio includes clean noise audio and mixed audio including noise audio.

12. A speech separation model training apparatus comprising:

13. The apparatus of claim 12, further comprising:

the pre-training module is used for carrying out Fourier transform on the first audio to obtain audio features of the first audio; coding the audio features through a coding model to obtain embedded features of the first audio; extracting the embedded features through an extraction model to obtain abstract features of target audio included in the first audio; and constructing a supervised training loss function to pre-train the coding model and the extraction model according to the labeled audio of the first audio, the embedded characteristic of the first audio and the abstract characteristic of the target audio included in the first audio.

14. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 11.

15. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 11.