CN115116448A

CN115116448A - Voice extraction method, neural network model training method, device and storage medium

Info

Publication number: CN115116448A
Application number: CN202211037918.4A
Authority: CN
Inventors: 刘文璟; 谢川; 谭斌; 展华益
Original assignee: Sichuan Qiruike Technology Co Ltd
Current assignee: Sichuan Qiruike Technology Co Ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2022-09-27
Anticipated expiration: 2042-08-29
Also published as: CN115116448B

Abstract

The invention discloses a voice extraction method, a neural network model training method, a device and a storage medium, wherein the method comprises the following steps: acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; inputting the aliased voice data of a plurality of speakers to be extracted into a voice coding network, and acquiring the time sequence representation of the aliased voice; inputting the voiceprint registration voice data of the target speaker into a speaker coding network to obtain the voiceprint characteristics of the target speaker; inputting the time sequence table of aliasing voice and the voiceprint characteristics of a target speaker into a speaker extraction network simultaneously, and extracting the voice time sequence representation belonging to the target speaker in the multi-speaker aliasing voice data; and representing the input voice decoding network by the extracted voice time sequence of the target speaker, and restoring a time domain voice signal of the target speaker. The method can accurately and effectively extract the voice of the target speaker from the aliasing voices of the multiple speakers.

Description

Voice extraction method, neural network model training method, device and storage medium

Technical Field

The invention relates to the technical field of voice separation, in particular to a voice extraction method, a neural network model training method, a device and a storage medium.

Background

The cocktail party problem was originally a well-known problem in 1953, addressed by Cherry, a british cognitive scientist, in studying the attention-selection mechanism, which attempted to explore the logical basis behind the process of human understanding the speech of a target speaker under interference from other speakers or noise, thereby modeling an intelligent machine that was able to filter out the signals of the target speaker. Colloquially described, the cocktail party problem concerns one auditory selection capability of humans in complex auditory environments. In this case, the person can easily focus on a certain sound stimulus of interest and ignore other background sounds, whereas the computational auditory model is heavily influenced by noise. How to design an auditory model capable of flexibly adapting to the cocktail party environment is an important problem in the field of computational hearing, and has very important research significance and application value on a series of important tasks such as speech recognition, speaker recognition, speech separation and the like.

With the vigorous development of artificial intelligence, speech separation represented by the cocktail party problem has made a tremendous progress in the popularization of deep learning. However, in most practical scenarios, the current speech separation technology is limited by the number of speakers, noise interference, and generalization of models, and the performance is not satisfactory. The target speaker voice extraction technology is used for directionally extracting the voice of the specified target speaker by acquiring additional voiceprint characteristic clues and under the guidance of the additional voiceprint characteristic clues, is not limited by the number of speakers, has strong generalization of models and robustness to noise environments, and is suitable for application scenes such as families and conferences and the like which can acquire the registered voice of the target speaker.

The early technology for extracting the voice of the target speaker uses a speaker self-adapting method, converts the amplitude spectrum characteristics of the voiceprint registered voice of the target speaker into weight parameters of a self-adapting layer through an auxiliary network, and obtains the output of the self-adapting layer by weighting the output of each sublayer of the self-adapting layer, so that a voice model can self-adapt to the speaker. For example, CN 112331181 a provides a method for extracting a target speaker's voice under a multi-speaker condition, which is based on obtaining adaptive parameters to dynamically adjust the output, so as to extract the voice of the target speaker.

The target speaker voice extraction technology based on deep learning is the main trend at present. Most of the schemes adopt a method of performing feature processing on a frequency domain and then reconstructing a time domain voice signal, for example, CN 113990344 a provides a method, a device and a medium for separating multi-person voice based on voiceprint features, which uses short-time fourier transform to extract voice spectrum features.

In the process of extracting the voice of the target speaker, modal fusion between the voiceprint feature vector of the target speaker and the voice representation thereof is a more critical problem. Because the feature forms of the two modes are inconsistent, the commonly adopted fusion method is to firstly expand the voiceprint feature vector to the form with the same voice representation through specific transformation, and then perform feature fusion by utilizing operations based on simple operation, such as splicing and the like. For example, CN 105489226 a provides a method for separating speeches from specific speeches based on a two-path self-attention mechanism, which uses a splicing method to perform the fusion of speaker coding features and speech features.

The current method for extracting the voice of the target speaker has the following problems:

1) however, the frequency domain methods have the potential problem of unstable frequency spectrum phase estimation, and the quality of the extracted target speaker voice is affected accordingly.

2) The mainstream fusion methods of the voiceprint feature vector and the voice characterization are methods based on simple operation, such as splicing, correlation between two modes is not fully mined, and specific information of each mode is lost to a certain extent in the fusion process.

Disclosure of Invention

The invention provides a voice extraction method, a neural network model training method, a device and a storage medium, which are used for solving the problems of poor effect of a frequency domain target speaker-based voice extraction method and the problems of the related technologies of insufficient fusion of a voiceprint feature vector and voice characterization in the prior art.

The technical scheme adopted by the invention is as follows:

according to a first aspect of the present disclosure, there is provided a speech extraction method, including:

acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;

inputting the aliasing voice data of a plurality of speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice;

inputting the voiceprint registration voice data of the target speaker into a speaker coding network in a trained preset neural network model to obtain the voiceprint characteristics of the target speaker;

simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

and inputting the extracted representation of the target speaker voice time sequence into a voice decoding network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker.

Further, the method for constructing the voice coding network comprises the following steps: and extracting the time sequence characterization by adopting a one-dimensional convolution encoder or a self-supervision pre-training model.

Further, a method for constructing the speaker coding network comprises the following steps:

acquiring a time sequence representation of the voiceprint registration voice data of the target speaker by adopting the voice coding network;

modeling the time dependence of the time sequence representation by adopting a convolution or circulation neural network;

and extracting the voiceprint characteristic vector of the target speaker from the time series representation after the modeling processing by adopting a pooling layer based on a self-attention mechanism.

Further, the method for constructing the speaker extraction network comprises the following steps:

performing feature fusion on the voiceprint feature vector of the target speaker and the corresponding voice time sequence representation input by adopting a gating convolution fusion method;

modeling the time dependency relationship of the time series representation obtained after the feature fusion, and outputting the time series representation after modeling;

in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to serve as one stage, the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output of the previous stage after the time dependency modeling processing;

and converting the voice time series representation output of the final stage into a mask, and multiplying the mask and the time series representation of the aliasing voice point by point to extract the time series representation of the voice of the target speaker.

Further, in the last stage, a step of not executing feature fusion is selected, and only the process of modeling the time dependency relationship is executed.

Further, the method for gated convolution fusion includes:

performing zero-offset one-dimensional convolution operation on the time sequence characterization input and the convolution kernel of the information branch to obtain an output signal of the information branch;

the time sequence representation is input into a convolution kernel of the gating branch to be convolved, a bias term obtained by linear layer conversion of a target speaker voice print characteristic vector is added, and then normalization and activation function processing are carried out to obtain an output signal of the gating branch;

and multiplying the output signals of the gating branch and the information branch point by point, and then connecting the multiplied output signals with the time sequence representation input in a residual error connection mode to obtain the time sequence representation after feature fusion.

Further, the sequence modeling of the time-series characterization is processed by a time convolution network, a dual-path recurrent neural network, or transformations.

Further, the voice decoding network is realized by a one-dimensional deconvolution layer or a fully-connected linear layer.

According to a second aspect of the disclosure, there is provided a neural network model training method, including:

acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;

inputting multi-speaker aliasing voice training sample data into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;

inputting voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model to obtain voiceprint characteristics of the target speaker;

simultaneously inputting the time sequence table of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model, and restoring a time domain voice signal of the target speaker;

calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model.

According to a third aspect of the present disclosure, there is provided a speech extraction device including:

the acquisition module is used for acquiring the aliasing voice data of multiple speakers to be extracted and the voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;

the voice coding network module is used for inputting the aliasing voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model and acquiring the time sequence representation of the aliasing voice;

the speaker coding network module is used for inputting the voiceprint registration voice data of the target speaker into the speaker coding network in the trained preset neural network model to acquire the voiceprint characteristics of the target speaker;

the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model and extracting the voice time sequence characterization belonging to the target speaker in the aliasing voice data of multiple speakers;

and the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a trained preset neural network model and restoring a time domain voice signal of the target speaker.

According to a fourth aspect of the present disclosure, there is provided a neural network model training apparatus, including:

the acquisition module is used for acquiring aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;

the voice coding network module is used for inputting aliased voice training sample data of multiple speakers into a voice coding network in a preset neural network model and acquiring a time sequence representation of the aliased voice;

the speaker encoder network module is used for inputting voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model to acquire voiceprint characteristics of the target speaker;

the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

the voice decoding network module is used for inputting the extracted voice time sequence representation of the target speaker into a voice decoding network in a preset neural network model and restoring a time domain voice signal of the target speaker;

and the loss function calculation module is used for calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training parameters of the preset neural network model based on gradient back propagation of the loss function, finishing the training process after the loss function is completely converged, and determining the trained preset neural network model as the trained preset neural network model.

According to a fifth aspect of the present disclosure, there is provided a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and a processor executes the computer program to implement the voice extraction method according to the first aspect.

The beneficial effects of the invention are:

1) and the voice features are extracted and coded in the time domain, so that the potential influence caused by the problems of unstable frequency spectrum phase estimation and the like in the frequency domain method is avoided.

2) And the target speaker voice print characteristic and the voice characteristic are fused by adopting a gating convolution fusion technology, the characteristics of two modes are fully fused through global condition modeling and a gating mechanism, and the specific information of each mode is effectively reserved, so that the quality of the extracted target speaker voice is improved.

3) The invention fully utilizes the voiceprint characteristic clue of the target speaker through an innovative characteristic fusion mode, and can accurately and effectively extract the voice of the target speaker from the aliasing voice of the multi-speaker.

Drawings

FIG. 1 is a flow chart illustrating the steps of a speech extraction method disclosed in the present invention;

FIG. 2 is a block diagram of a speech extraction method according to the present disclosure;

FIG. 3 is a block diagram of a speaker extraction network according to the present disclosure;

FIG. 4 is a flowchart illustrating the steps of a gated convolution fusion method according to the present disclosure;

FIG. 5 is a block diagram of a gated convolution fusion method according to the present disclosure;

FIG. 6 is a flow chart illustrating the steps of a neural network model training method disclosed in the present invention;

FIG. 7 is a block diagram of a speech extraction apparatus according to the present disclosure;

fig. 8 is a block diagram of a neural network model training apparatus according to the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail below with reference to the accompanying drawings, but embodiments of the present invention are not limited thereto.

Example 1:

as shown in fig. 1 and fig. 2, the speech extraction method provided in this embodiment includes the following steps:

s1.1, acquiring aliasing voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the multi-speaker aliasing voice data comprises the target speaker voice.

Specifically, taking a sampling rate of 16kHz as an example, a voice segment to be extracted and a voiceprint registration voice segment of a specified target speaker with arbitrary length are collected. The voiceprint registration voice of the target speaker is as follows: the clean voice of the targeted speaker is used for voiceprint enrollment.

S1.2, inputting the aliasing voice data of the multiple speakers to be extracted into a voice coding network in a trained preset neural network model, and acquiring the time sequence representation of the aliasing voice.

In particular, methods including at least one-dimensional convolutional encoders or an unsupervised pre-training model can be employed to extract the time series characterization. The one-dimensional convolutional encoder can be realized by a one-dimensional convolutional network layer (1-D CNN) and a linear rectifying layer (ReLU), wherein the size of a convolutional kernel is L, the step length is L/2, the number of input channels is 1, and the number of output channels is D. The method of the self-supervision pre-training can adopt an open-source pre-training neural network model of Wav2vec2, Hubert and the like in standard configuration to extract the time series characterization.

S1.3, inputting the voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model, and acquiring the voiceprint characteristics of the target speaker.

S1.3.1, adopting the voice coding network to obtain the time series representation of the voice print registration voice data of the target speaker.

Specifically, the time series representation of the voiceprint registration voice data of the target speaker can be extracted by directly adopting the voice coding network in S1.2.

S1.3.2, modeling the time dependence of the time series representation by adopting a convolution or circulation neural network;

specifically, the time dependence of time series characterization can be modeled by stacking multiple layers of convolutional networks with residual Connection (CNN) or bidirectional long-term memory networks (BiLSTM). On one hand, a convolutional network with n being more than or equal to 5 layers can be adopted for modeling, wherein the number of input and output channels of the network in the first layer is (D, O), the number of input and output channels in the network in the middle layer is (O, O), the number of the last 3 layers is (O, P), the number of the last 2 layers is (P, P), and the number of the last layer is (P, H). In addition, the convolution networks except the first layer and the last layer adopt a residual error connection mode, and the operation of layer normalization is added before the first layer. On the other hand, n layers of the BilSTM network with the input dimension of D and the hidden dimension of H can be adopted for modeling, and then the processing is carried out through a ReLU activation function and a full connection layer with the input dimension of H.

S1.3.3, extracting the vocal print feature vector of the target speaker from the time series representation after modeling processing by using a pooling layer based on a self attention mechanism.

Specifically, the pooling layer based on the self-attention mechanism consists of a feed-forward network and a pooling network. The feedforward network consists of two fully-connected layers, wherein the input and output channels are (H, H), (H,1), respectively. The pooling network firstly calculates attention coefficients in a masking (mask) mode, then performs pooling operation in a weighted average mode after probability weights on all time nodes are obtained by utilizing a softmax function, and finally obtains voiceprint feature vectors of the target speaker after processing through a full connection layer and a tanh activation function.

As shown in fig. 3, S1.4, the time sequence table of the aliasing voice and the voiceprint feature of the target speaker are simultaneously input into the speaker extraction network in the trained preset neural network model, and the voice time sequence characterization belonging to the target speaker in the multi-speaker aliasing voice data is extracted.

S1.4.1, adopting gate control convolution fusion method to fuse the voice print character vector of the target speaker and the corresponding voice time sequence representation input.

S1.4.2, modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling processing.

Specifically, the time dependence of the time series characterization can be modeled at least using a Time Convolutional Network (TCN), a Dual-path recurrent neural network (Dual-path RNN), or Transformers. For example, a time convolutional network typically consists of stacked 8-layer time-domain convolutional network layers. Performing characteristic dimension transformation on each time domain convolution network layer through convolution of 1X 1, performing one-dimensional convolution along the time dimension by adopting convolution kernels with the size K =3 and the step length S =1, and setting the expansion coefficient of the X layer to be 2 (X-1); before each convolution operation, a layer normalization and a parameterized linear rectification activation function are adopted for processing, finally, the characteristic dimension is restored through 1 x 1 convolution, a mask of the time sequence representation is output, and the mask and the output mask are multiplied point by point to obtain the time sequence representation after modeling processing.

S1.4.3, repeating the above two steps (S1.4.1 and S1.4.2) in multiple stages, except that the input of the first stage is the time series representation of aliasing voice, then the voice time series representation input required by feature fusion of each stage is the voice time series representation output after modeling processing of the previous stage. Namely: in the speaker extraction network, the feature fusion and the time dependency modeling are connected in series to serve as one stage, the processing of a plurality of stages is repeated, only the time sequence representation of aliasing voice is input in the feature fusion of the first stage, and then the voice time sequence representation required by the feature fusion of each stage is input into the voice time sequence representation output of the previous stage after the time dependency modeling processing;

specifically, feature fusion and time-dependent relationship modeling of M =4 stages were performed. Because the fused features need to be fully expressed to obtain accurate time series representation output, the step of not executing feature fusion can be selected in the last stage, and only the time dependency modeling is carried out, so that the expression capability of the modeling processing in the last two stages is enhanced, and the performance of the system is improved.

S1.4.4, converting the voice time series representation output of the final stage into a mask, and multiplying the mask point by point with the time series representation of aliasing voice to extract the time series representation of the target speaker voice.

Specifically, the final stage speech time series characterization output can be converted into an estimate of the mask by a 1 × 1 convolution and the ReLU activation function.

And S1.5, inputting the extracted representation of the target speaker voice time sequence into a voice decoder network in a trained preset neural network model, and restoring a time domain voice signal of the target speaker.

In particular, the speech decoding network may be implemented by a one-dimensional deconvolution layer or a fully-connected linear layer. The one-dimensional deconvolution layer usually adopts deconvolution operation with an input channel D and an output channel L.

As shown in fig. 4 and fig. 5, the embodiment elaborates the gated convolution fusion method, which includes the following steps:

and S2.1, performing zero-offset one-dimensional convolution operation on the time sequence representation input and the convolution kernel of the information branch to obtain an output signal of the information branch.

Specifically, the size of a convolution kernel adopted in the one-dimensional convolution operation is 3, the filling length is 1, and an output signal of the information branch is obtained after the convolution operation is completed.

S2.2, inputting the time sequence representation into a convolution kernel of the gating branch for convolution, adding a bias term obtained by linear layer conversion of the target speaker voiceprint feature vector, and obtaining an output signal of the gating branch through normalization and processing of an activation function;

specifically, the gating branch adopts one-dimensional convolution operation with a bias term, wherein the configuration of the convolution operation is the same as that of the voice branch, the bias term is generated by mapping a target speaker voiceprint feature vector based on a fully-connected network linear layer, and then layer normalization (layer norm) and a sigmoid activation function are adopted for processing to obtain an output signal of the gating branch.

And S2.3, multiplying the output signals of the gating branch and the information branch point by point, and then obtaining the time sequence representation after feature fusion with the time sequence representation input in a residual error connection mode.

Specifically, the feature fusion of the two modes is performed in a manner of controlling the transmission of the voice information stream in the information branch by the output signal of the gating branch. On one hand, the voice information flow in the information branch reserves the complete content information of the voice of the target speaker; on the other hand, the output signal of the gating branch fuses clues of the voiceprint characteristics into the control signal, and the extraction process of the target speaker information in the information branch is interfered by a gating mode instead of a direct means based on simple operation, such as splicing. And finally, the convergence of the fusion module in the deep neural network is enhanced by adopting a residual connection mode.

Example 2

As shown in fig. 6, the neural network model training method provided in this embodiment includes the following steps:

and S3.1, obtaining aliasing voice training sample data of multiple speakers and voiceprint registration voice training sample data of the target speaker. The multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice.

Specifically, resampling operation is carried out on all voice training sample data at a sampling rate of Fs =16k, each target speaker voice training sample data and non-target speaker voice training sample data which are used as real labels are firstly divided into voice fragments with the duration of 4s, then one 4s voice fragment is randomly selected from one or more non-target speakers for each 4s target speaker voice fragment to be matched, the voice fragments are mixed through amplitude transformation and superposition according to a randomly allocated signal-to-noise ratio within a range of-5 dB to generate 4s multi-speaker aliasing voice fragments, finally the corresponding multi-speaker aliasing voice fragments and the corresponding target speaker voice fragments are respectively used as input and real labels, and a training set, a verification set and a test set are divided according to a common proportion. It should be noted that, in this embodiment, both the voice of the target speaker used for obtaining the aliased voice and the voiceprint enrollment voice of the target speaker come from the target speaker voice training sample data serving as the real tag, but the voiceprint enrollment voice of the target speaker in training needs to select a different voice fragment from the target speaker voice training sample data serving as the real tag, the different voice fragment being used in the aliased voice.

S3.2, inputting aliasing voice training sample data of multiple speakers into a voice coding network in a preset neural network model, and acquiring time sequence representation of aliasing voice;

s3.3, inputting voiceprint registration voice training sample data of the target speaker into a speaker coding network in a preset neural network model to obtain the voiceprint characteristics of the target speaker;

s3.4, simultaneously inputting the time sequence table of aliasing voice and the voiceprint characteristics of the target speaker into a speaker extraction network in a preset neural network model, and extracting the voice time sequence representation belonging to the target speaker in the aliasing voice data of multiple speakers;

s3.5, inputting the extracted representation of the target speaker voice time sequence into a voice decoder network in a preset neural network model, and restoring a time domain voice signal of the target speaker;

and S3.6, calculating a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, updating and training the parameters of the preset neural network model based on the gradient back propagation of the loss function, and finishing the training process after the loss function is completely converged. And determining the trained preset neural network model as the trained preset neural network model.

Specifically, the default neural network model may be trained using SI-SDR (scale-innovative signal-to-disturbance ratio) as a loss function. The Adam Optimizer is selected by the training strategy, the initial learning rate is set to be 1e-3, the maximum training iteration number is 100, when the loss of the verification set is not reduced for 3 consecutive epochs (lower than the minimum loss obtained before), the learning rate adjustment is halved, and when 10 consecutive epochs are reduced, the training is ended in advance. And storing the model parameters after the training is finished.

Example 3

Referring to fig. 7, the present embodiment provides a speech extraction apparatus 100, including:

the acquisition module 110 is configured to acquire aliased voice data of multiple speakers to be extracted and voiceprint registration voice data of a target speaker; the aliasing voice data of the multiple speakers comprises the voice of the target speaker;

the speech coding network module 120 is configured to input speech data to be extracted, which is obtained by aliasing of multiple speakers, into a speech coding network in a trained preset neural network model, and obtain a time sequence representation of the aliasing speech;

a speaker coding network module 130, configured to input voiceprint registration voice data of the target speaker into a speaker coding network in the trained preset neural network model, and obtain a voiceprint feature of the target speaker;

the speaker extraction network module 140 is configured to simultaneously input the time sequence table of the aliasing voices and the voiceprint characteristics of the target speaker into the speaker extraction network in the trained preset neural network model, and extract the voice time sequence characterization belonging to the target speaker in the multi-speaker aliasing voice data;

and the speech decoding network module 150 is configured to input the extracted representation of the time sequence of the speech of the target speaker to a speech decoding network in a trained preset neural network model, and restore a time-domain speech signal of the target speaker.

Example 4

Referring to fig. 8, the present embodiment provides a neural network model training apparatus 200, including:

the acquisition module 210 is configured to acquire aliased voice training sample data of multiple speakers and voiceprint registration voice training sample data of a target speaker; the multi-speaker aliasing voice training sample data is obtained by mixing target speaker voice serving as a real label with randomly selected non-target speaker voice;

the speech coding network module 220 is configured to input aliased speech training sample data of multiple speakers to a speech coding network in a preset neural network model, and obtain a time sequence representation of the aliased speech;

a speaker encoder network module 230, configured to input voiceprint registration voice training sample data of a target speaker into a speaker encoder network in a preset neural network model, and obtain a voiceprint feature of the target speaker;

the speaker extraction network module 240 is configured to input the time sequence table of the aliasing voice and the voiceprint feature of the target speaker to a speaker extraction network in a preset neural network model at the same time, and extract a voice time sequence representation belonging to the target speaker from the multi-speaker aliasing voice data;

the voice decoding network module 250 is configured to input the extracted time series representation of the target speaker to a voice decoding network in a preset neural network model, and restore a time-domain voice signal of the target speaker;

and the loss function calculation module 260 is configured to calculate a loss function between the target speaker voice extracted by the preset neural network model and the target speaker voice serving as a real label, update and train parameters of the preset neural network model based on gradient back propagation of the loss function, end the training process after the loss function is completely converged, and determine the trained preset neural network model as the trained preset neural network model.

Example 5

The present embodiment provides a computer-readable storage medium, in which a computer program is stored, and a processor executes the computer program to implement the speech extraction method according to embodiment 1.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech extraction, comprising:

2. The speech extraction method of claim 1, wherein the method of constructing the speech coding network comprises: and extracting the time sequence representation by adopting a one-dimensional convolution coder or a self-supervision pre-training model.

3. The method of claim 1, wherein the method of constructing the speaker coding network comprises:

acquiring time series representation of voiceprint registration voice data of a target speaker by adopting the voice coding network;

and extracting the vocal print characteristic vector of the target speaker from the time series representation after modeling processing by adopting a pooling layer based on a self-attention mechanism.

4. The speech extraction method of claim 1, wherein the method of constructing the speaker extraction network comprises:

modeling the time dependence relationship of the time series representation obtained after feature fusion, and outputting the time series representation after modeling;

5. The speech extraction method according to claim 4, wherein the step of not performing feature fusion is selected in the last stage, and only the process of modeling the time dependency is performed.

6. The method of speech extraction according to claim 4, wherein said gated convolution fusion method comprises:

7. The method of speech extraction according to claim 4, wherein said sequence modeling of time series characterizations is processed through a time convolutional network, a two-path recurrent neural network, or transformations.

8. The speech extraction method of claim 1 wherein the speech decoding network is implemented by a one-dimensional deconvolution layer or a fully-connected linear layer.

9. A neural network model training method is characterized by comprising the following steps:

10. A speech extraction device, comprising:

the speaker extraction network module is used for simultaneously inputting the time sequence list of aliasing voices and the voiceprint characteristics of the target speaker into a speaker extraction network in a trained preset neural network model and extracting the voice time sequence characteristics belonging to the target speaker in the multi-speaker aliasing voice data;

11. A neural network model training device, comprising:

12. A computer-readable storage medium, having a computer program stored thereon, wherein a processor executes the computer program to implement the speech extraction method according to any one of claims 1-8.