CN110176243B

CN110176243B - Speech enhancement method, model training method, device and computer equipment

Info

Publication number: CN110176243B
Application number: CN201810911283.3A
Authority: CN
Inventors: 王燕南; 甄广启
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2023-10-31
Anticipated expiration: 2038-08-10
Also published as: CN110176243A

Abstract

The application relates to a voice enhancement method, a model training method, a device and computer equipment, wherein the method comprises the following steps: acquiring voice; extracting speech features from the speech; determining an identity feature for identifying the acoustic identity of the speaker according to the voice; splicing the voice features and the identity features to obtain spliced features; and processing the spliced characteristic through a speaker-independent voice enhancement model to obtain the target voice subjected to voice enhancement. The scheme provided by the application can avoid the problem of poor quality of the obtained voice because the SI model is not obtained by the voice training of the speaker, thereby improving the quality of the target voice obtained after voice enhancement.

Description

Speech enhancement method, model training method, device and computer equipment

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech enhancement method, a model training device, and a computer device.

Background

The essence of the voice enhancement (Speech Enhancement) is voice noise reduction, which can effectively inhibit various interference noises in voice, thereby improving the quality and the intelligibility of voice. In general, the speech collected by the microphone is usually noisy to some extent, and the noisy speech is processed into non-noisy speech by means of speech enhancement.

There are various schemes for implementing speech enhancement, in which the common practice is: the noise-carrying speech is collected and input into a SI (speaker independent ) model to obtain speech-enhanced speech. However, since the SI model is not obtained by the speech training of the speaker himself, the speech enhancement processing is performed on the speaker's speech by using the SI model, so that the quality of the processed speech is poor.

Disclosure of Invention

Based on this, it is necessary to provide a speech enhancement method, a model training method, a device and a computer equipment aiming at the technical problem that the quality of the processed speech is poor when the speech of a speaker is enhanced by using an SI model.

A method of speech enhancement, comprising:

acquiring voice;

extracting speech features from the speech;

determining an identity feature for identifying the acoustic identity of the speaker according to the voice;

splicing the voice features and the identity features to obtain spliced features;

and processing the spliced characteristic through a speaker-independent voice enhancement model to obtain the target voice subjected to voice enhancement.

A speech enhancement apparatus comprising:

The voice acquisition module is used for acquiring voice;

a voice feature extraction module for extracting voice features from the voice;

the identity feature determining module is used for determining identity features for identifying the acoustic identity of the speaker according to the voice;

the characteristic splicing module is used for splicing the voice characteristic and the identity characteristic to obtain a spliced characteristic;

and the processing module is used for processing the splicing characteristics through a speaker-independent voice enhancement model to obtain voice enhanced target voice.

A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described speech enhancement method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the above-described speech enhancement method.

In the voice enhancement method, the voice enhancement device, the storage medium and the computer equipment, the voice characteristics and the speaker acoustic identity characteristics are extracted and spliced from the collected voice, so that the spliced characteristics with the voice characteristics and the speaker acoustic identity characteristics are obtained. Because the splicing characteristics have the identity characteristics of the acoustic identity of the speaker, when the speech enhancement model irrelevant to the speaker processes the splicing characteristics, the identity characteristics of the acoustic identity of the speaker in the splicing characteristics can be predicted, so that the noise in the speech can be eliminated, the target speech after speech enhancement is obtained, the problem that the quality of the obtained speech is poor because the SI model is not obtained by the speech training of the speaker in the traditional scheme is avoided, and the quality of the target speech after speech enhancement is improved.

A model training method, comprising:

acquiring a noise voice sample and a noiseless voice sample; the noise voice sample and the noiseless voice sample correspond to the same speaker;

extracting training voice characteristics from the noise voice samples;

extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from the noiseless voice sample;

splicing the training voice characteristics and the training identity characteristics to obtain training splicing characteristics;

and training a speaker-independent voice enhancement model by taking the training splicing characteristics as training input and the training reference voice characteristics as training output.

A model training apparatus comprising:

the sample acquisition module is used for acquiring noise voice samples and noiseless voice samples; the noise voice sample and the noiseless voice sample correspond to the same speaker;

the training feature extraction module is used for extracting training voice features from the noise voice samples; extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from the noiseless voice sample;

the training characteristic splicing module is used for splicing the training voice characteristics and the training identity characteristics to obtain training splicing characteristics;

And the training module is used for taking the training splicing characteristics as training input, taking the training reference voice characteristics as training output and training a speaker-independent voice enhancement model.

A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the model training method described above.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the model training method described above.

In the model training method, the device, the storage medium and the computer equipment, training voice characteristics are extracted from noise voice samples, training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker are extracted from noise-free voice samples, and further training splicing characteristics formed by splicing the training voice characteristics and the training identity characteristics are obtained. The training splice features are used as training input, the training reference voice features are used as training output, and the speaker-independent voice enhancement model is trained, so that the trained voice enhancement model can predict the identity features of the speaker acoustic identity in the splice features input subsequently, the problem that in the traditional scheme, the obtained voice quality is poor because the SI model is not obtained by the voice training of the speaker, and the quality of target voice obtained after voice enhancement is improved.

Drawings

FIG. 1 is a flow diagram of a method of speech enhancement in one embodiment;

FIG. 2 is a schematic diagram of converting time domain speech into a frequency spectrum in one embodiment;

FIG. 3 is a flow diagram of acquiring speech features in one embodiment;

FIG. 4 is a schematic diagram of processing splice features in one embodiment;

FIG. 5 is a flowchart illustrating steps for processing speech to obtain speech features in one embodiment;

FIG. 6 is a flow chart illustrating steps for training a speaker independent speech enhancement model in one embodiment;

FIG. 7 is a flow chart illustrating steps for determining identity characteristics of a speaker's acoustic identity in one embodiment;

FIG. 8 is a flowchart illustrating steps performed in one embodiment to process the splice feature to obtain a speech-enhanced target speech;

FIG. 9 is a schematic diagram of a process of processing a noisy speech signal with a bi-directional LSTM model to obtain a speech-enhanced clean speech signal in one embodiment;

FIG. 10 is a flow diagram of a model training method in one embodiment;

FIG. 11 is a flowchart of a method for speech enhancement according to another embodiment;

FIG. 12 is a block diagram of a speech enhancement apparatus in one embodiment;

FIG. 13 is a block diagram of a speech enhancement apparatus according to another embodiment;

FIG. 14 is a block diagram of a model training device in one embodiment;

FIG. 15 is a block diagram of a computer device in one embodiment;

fig. 16 is a block diagram of a computer device in another embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As shown in fig. 1, in one embodiment, a speech enhancement method is provided. The embodiment is mainly exemplified by the method applied to a terminal, wherein the terminal can be a mobile phone, a computer or an intelligent robot. Referring to fig. 1, the voice enhancement method specifically includes the steps of:

s102, voice is acquired.

Wherein the acquired speech may carry noise. The speech may be continuous speech in the time domain or may be discretized speech after sampling.

The method for obtaining the voice is different from the scene to which the voice enhancement method is applied. Thus, S102 may be analyzed for two scenarios:

Scene 1, voice call scene.

In one embodiment, after the terminal establishes a communication connection with the remote device, the terminal receives the voice transmitted from the remote device, and then performs S104. The received voice may carry noise, which may be background noise, electronic circuit noise, power supply noise, etc. in the environment.

For example, the local user performs a voice call with the remote user, and after the terminal of the local user establishes a communication connection with the remote device of the remote user, the terminal of the local user may receive the voice sent by the remote user. The speech may carry background noise, electronic circuit noise, power supply noise, etc

Scene 2, a scene of human-machine interaction or speech through a speaker.

In order to prevent the background noise from affecting the recognition of the voice by the intelligent robot, the terminal performs voice enhancement processing on the collected voice before recognizing the voice. Or, in order not to affect the audience's audio-visual effect, the terminal performs a voice enhancement process on the collected voice before playing the voice through the speaker.

In one embodiment, the terminal collects the voice in the environment through the microphone, the voice carries the background noise, and S104 is performed after the voice is collected.

S104, extracting voice characteristics from the voice.

The speech features may be, among other things, logarithmic power spectra or mel-frequency cepstral coefficients for speech.

In one embodiment, the terminal fourier transforms the collected speech, converting the speech in the time domain into a frequency spectrum in the frequency domain. The terminal obtains the amplitude corresponding to the frequency spectrum, and calculates the power spectrum according to the power density function.

For example, assuming that the signal expression of the voice is F (t), the F (t) is fourier transformed to obtain a spectrum, and the spectrum is expressed as F _T (w) as shown in fig. 2. Therefore, the power spectral density function of f (t) is:

the amplitude corresponding to the frequency spectrum is substituted into the power spectrum density function, so that the power spectrum (also called as power spectrum density) of the voice can be obtained.

S106, determining the identity characteristic for identifying the acoustic identity of the speaker according to the voice.

Wherein the identity of the speaker's acoustic identity is referred to as i-vector. The speaker information and the channel information are concentrated in a global transformation matrix T, and a GMM (Gaussian mixture model ) mean supervector M containing the speaker information and the channel information is projected onto the global transformation matrix T at a low rank, thereby obtaining a low-dimensional vector w containing only the speaker information, which is called an i-vector (identity vector).

The i-vector is extracted by means of a GMM-UBM (Gaussian Mixture Model-Universal Back-ground Model, gaussian mixture Model-general background Model) as shown in FIG. 3. The extraction method of the i-vector can be referred to as the following formula:

M＝m+Tw

wherein M represents a Gaussian mixture model mean value supervector, M is a supervector irrelevant to speaker information and channel information, the overall transformation matrix T is a mapping from a high-dimensional space to a low-dimensional space, and w is the extracted i-vector.

The t frame of the ith speech signal is denoted as x _t ⁽ⁱ⁾ Let x be _t ⁽ⁱ⁾ Obeys the following distribution:

wherein T is _k Is an overall transformation matrix, u _k Sum sigma _k Is the mean and variance of the kth gaussian, w ⁽ⁱ⁾ I.e., i-vector, which is the i-th segment of speech. Calculation of x by posterior probability _t ⁽ⁱ⁾ The probability on the kth state is:

γ _kt ⁽ⁱ⁾ ＝p(k|x _t ⁽ⁱ⁾ )

updating zero order N on k Gaussian components of a speech signal relative to a universal background model UBM by maximum posterior probability and maximum expected EM algorithm _k ⁽ⁱ⁾ First order F _k ⁽ⁱ⁾ And second order S _k ⁽ⁱ⁾ The statistics are calculated as follows:

the zero order N _k ⁽ⁱ⁾ First order F _k ⁽ⁱ⁾ And second order S _k ⁽ⁱ⁾ The statistics are used to train the overall transformation matrix T, and x is repeatedly updated with the above statistics during subspace training _t ⁽ⁱ⁾ Is a priori of (2) distribution. After the overall transformation matrix T is obtained, i-vector can be obtained from m=m+tw.

S108, splicing the voice features and the identity features to obtain spliced features.

In one embodiment, the terminal splices the voice features of the set number of frames with the identity features to obtain spliced features.

For example, the terminal extracts M frames of speaker voices, and obtains voice features from each frame of voices. Where the dimensions of each frame of speech features may be 257 or 512. The dimension of the i-vector may be 100. The terminal can splice an i-vector for every M frames of voice features, M is a positive integer greater than or equal to 1.

S110, processing the splicing characteristics through a speaker-independent voice enhancement model to obtain the target voice after voice enhancement.

In one embodiment, the speech enhancement model is trained based on training splice features; training splice characteristics are formed by splicing training voice characteristics extracted from noise voice samples and training identity characteristics extracted from noise-free voice samples; the noisy speech samples and the noiseless speech samples correspond to the same speaker.

The speech enhancement model may be a unidirectional or bidirectional LSTM (Long Short-Term Memory) network model, or may be another neural network model, such as a recurrent neural network model, a convolutional neural network model, and the like.

As an example, as shown in fig. 4, M-frame 257-dimensional speech features are stitched with 100-dimensional i-vectors, and the stitched features are input into a neural network model. Wherein M can be any positive integer from 1 to 10.

In the above embodiment, the voice feature and the speaker acoustic identity feature are extracted and spliced from the collected voice, so that the spliced feature having the voice feature and the speaker acoustic identity feature is obtained. Because the splicing characteristics have the identity characteristics of the acoustic identity of the speaker, when the speech enhancement model irrelevant to the speaker processes the splicing characteristics, the identity characteristics of the acoustic identity of the speaker in the splicing characteristics can be predicted, so that the noise in the speech can be eliminated, the target speech after speech enhancement is obtained, the problem that the quality of the obtained speech is poor because the SI model is not obtained by the speech training of the speaker in the traditional scheme is avoided, and the quality of the target speech after speech enhancement is improved.

In one embodiment, as shown in fig. 5, S104 may specifically include:

s502, framing and windowing are carried out on the voice.

Wherein N samples are combined into an observation unit, which is called a frame, the value of N may be 257, and the frame length is about 20ms to 30ms.

The speech of each frame is multiplied by a window function to increase the continuity between the current frame and the left and right connected frames. Wherein the window function may be a hamming window. Assuming that each frame of speech obtained after the speech framing is x (n), where n is the size of the frame, the value may be an integer greater than or equal to 0, and the expression of x' (n) =x (n) ×h (n) after multiplying the frame of speech by the hamming window is:

wherein, the value of a is different, and the formed Hamming window is different. The window function may be a hamming window, a hanning window, a gaussian window, a blackman window, or the like.

S504, converting each frame of voice obtained after processing to obtain the frequency spectrum of each frame of voice.

In one embodiment, the terminal transforms the processed speech of each frame, so as to obtain the frequency spectrum of each frame in the frequency domain. The mode of the transformation can be Fourier transformation, fast Fourier transformation, discrete Fourier transformation, and the like. In the embodiment of the present invention, the transformation method is not specifically limited, and other transformation methods may be used in addition to the above-mentioned several transformation methods, and the corresponding transformation method may be selected according to the actual situation.

S506, determining the voice characteristics according to the frequency spectrum of each frame of voice.

The speech feature may be a logarithmic power spectrum or mel-frequency cepstral coefficient of speech, and thus S502 may be divided into the following two scenario discussions:

Scenario 1, speech features a logarithmic power spectrum of speech.

In one embodiment, S502 may specifically include: the terminal determines a power spectrum according to the frequency spectrum of each frame of voice; obtaining a logarithmic power spectrum corresponding to the power spectrum; the log power spectrum is determined as a speech feature.

For example, assuming that the signal expression of the collected speech is x (n), the speech after framing and windowing is x '(n) =x (n) ×h (n), the discrete fourier transform is performed on the speech after windowing, x' (n) =x (n) ×h (n), to obtain the corresponding spectrum signal as:

where N represents the number of points of the discrete fourier transform.

When the frequency spectrum of each frame of voice is obtained, the terminal calculates a corresponding power spectrum, and obtains a logarithmic value of the power spectrum, thereby obtaining a corresponding voice characteristic.

Scene 2, speech features are mel-frequency cepstral coefficients for speech.

In one embodiment, S502 may specifically include: the terminal determines a power spectrum according to the frequency spectrum of each frame of voice; obtaining a logarithmic power spectrum corresponding to the power spectrum; and determining a result obtained by discrete cosine transformation of the logarithmic power spectrum as a voice characteristic.

After obtaining a logarithmic power spectrum through a scene 1, inputting the logarithmic power spectrum into a triangular filter with a Mel scale by a terminal, and obtaining a Mel frequency cepstrum coefficient after discrete cosine transformation, wherein the obtained Mel frequency cepstrum coefficient is:

The logarithmic energy is brought into discrete cosine transform to obtain the Mel frequency cepstrum parameter of L-order, wherein the L-order refers to the Mel frequency cepstrum coefficient order, and the value can be 12-16.M refers to the number of triangular filters.

In the above embodiment, the collected voice is processed by framing and windowing so that the voice signal becomes smooth and continuous. Each frame of voice obtained after framing and windowing is converted into a frequency spectrum, and the voice characteristics are determined according to the frequency spectrum of each frame of voice, so that the obtained voice characteristics are more accurate.

In one embodiment, as shown in fig. 6, the method may further comprise:

s602, acquiring a noise voice sample and a noiseless voice sample; the noisy speech samples and the noiseless speech samples correspond to the same speaker.

Wherein, noise speech samples refer to noise-carrying speech for training. Noiseless speech samples refer to speech that is not carrying noise for training. Both the noisy speech samples and the corresponding noiseless speech samples are taken from the same speaker. For example, the noise speech sample a1 and the corresponding noise-free speech sample b1 are both taken from the same person. The noise speech sample a2 and the corresponding noise-free speech sample b2 are both taken from the same person. While noise speech sample a1 and noise speech sample a2 are from different speakers.

S604, extracting training voice characteristics from the noise voice sample.

The training speech features may be, among other things, logarithmic power spectra or mel-frequency cepstral coefficients for speech.

In one embodiment, the terminal fourier transforms the collected noise speech samples to convert speech in the time domain to a frequency spectrum in the frequency domain. The terminal obtains the amplitude corresponding to the frequency spectrum, and calculates the power spectrum according to the power density function.

For example, assuming that the signal expression of the noise speech sample is F (t), the F (t) is fourier transformed to obtain the frequency spectrum F _T (w) as shown in fig. 2. Therefore, the power spectral density function of f (t) is:

will frequency spectrum F _T And (w) substituting the corresponding amplitude value into a power spectrum density function to obtain a power spectrum of the noise voice sample.

S606, extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of the speaker from the noiseless voice sample.

Wherein the training identity characteristic of the acoustic identity of the speaker is referred to as i-vector.

For the way to extract the training reference speech features from the noiseless speech samples, reference may be made to S604, which is not described here again. The manner of extracting the training identity feature for identifying the acoustic identity of the speaker from the noise-free voice sample may be the parameter S106, and will not be described herein.

And S608, splicing the training voice features and the training identity features to obtain training splicing features.

In one embodiment, the terminal splices training speech features of a set number of frames with training identity features to obtain training splice features.

For example, the terminal extracts 100 frames of noise speech samples, and obtains training speech features from each frame of noise speech samples. Where the dimensions of the training speech features for each frame may be 257 or 512. The dimension of the i-vector may be 100. The terminal can splice an i-vector for each M frames of training speech features, M being a positive integer greater than or equal to 1.

S610, training the speech enhancement model by taking training splice features as training input and training reference speech features as training output.

In one embodiment, the terminal trains the speech enhancement model with training splice features as training inputs to obtain the output features after training. And the terminal compares the difference between the characteristics output after training and the training reference voice characteristics and adjusts the parameters in the voice enhancement model. And stopping training the voice enhancement model when the difference between the characteristics output after training and the training reference voice characteristics meets the preset condition, so as to obtain a trained voice enhancement model.

As an example, as shown in fig. 4, the terminal concatenates training speech features in M frames 257 dimensions with i-vector in 100 dimensions, and inputs the concatenated features into a neural network model. The terminal takes the training reference voice characteristics as training output to train the voice enhancement model. Wherein M can be any positive integer from 1 to 10.

In the above embodiment, the training speech features are extracted from the noise speech samples, and the training reference speech features and the training identity features for identifying the acoustic identity of the speaker are extracted from the noise-free speech samples, so as to obtain training splice features formed by splicing the training speech features and the training identity features. The training splice features are used as training input, the training reference voice features are used as training output, and the speaker-independent voice enhancement model is trained, so that the trained voice enhancement model can predict the identity features of the speaker acoustic identity in the splice features input subsequently, the problem that in the traditional scheme, the obtained voice quality is poor because the SI model is not obtained by the voice training of the speaker, and the quality of target voice obtained after voice enhancement is improved.

In one embodiment, as shown in fig. 7, S106 may specifically include:

S702, processing the extracted voice features through an identity feature extraction model to obtain an overall transformation matrix corresponding to the identity features of the speaker acoustic identity.

The identity feature extraction model may be a GMM-UBM model, among others.

The expression for obtaining i-vector is: m=m+tw. Wherein M represents a mean supervector of the Gaussian mixture model, M is a supervector irrelevant to a speaker and a channel, the overall transformation matrix T is a mapping from a high-dimensional space to a low-dimensional space, and w is an extracted i-vector.

In one embodiment, the terminal extracts a third training speech feature from the acquired noise speech samples; inputting the extracted third training voice features into an identity feature extraction model to obtain a training overall transformation matrix corresponding to the identity features of the acoustic identity of the speaker; extracting training identity characteristic parameters from the voice according to the training overall transformation matrix; and adjusting the identity characteristic extraction model according to the difference between the training identity characteristic parameter and the preset target identity characteristic parameter until the training stopping condition is met.

S704, extracting identity characteristic parameters from the voice according to the overall transformation matrix.

In one embodiment, after the training of the identity feature extraction model is completed, a trained identity feature extraction model is obtained, and a training population transformation matrix is obtained. The terminal obtains the i-vector parameter w through M=m+Tw.

Since the expression of i-vector is m=m+tw. After the overall transformation matrix T is obtained, the terminal extracts the phoneme-related i-vector parameters from the voice carrying noise through the overall transformation matrix, and the dimension of the i-vector parameters is higher, so that dimension reduction processing is required to be carried out on the i-vector parameters during application.

S706, reducing the dimension of the extracted identity characteristic parameters to obtain the identity characteristics for identifying the acoustic identity of the speaker.

In the above embodiment, the extracted voice features are processed through the identity feature extraction model, so as to obtain the overall transformation matrix corresponding to the identity features of the speaker acoustic identity, and the identity features for identifying the speaker acoustic identity are obtained through the overall transformation matrix, so that the voice feature identity features are spliced and then input into the voice enhancement model, the voice enhancement model identifies the predicted identity features, and noise in the voice can be eliminated, so that the target voice subjected to voice enhancement is obtained.

In one embodiment, as shown in fig. 8, S108 may specifically include:

s802, normalizing the spliced characteristics.

In one embodiment, the terminal calculates the mean and variance of the splice feature, and normalizes the splice feature according to the calculated mean and variance. Wherein, the expression form of the splicing characteristic can be vector or matrix.

For example, assuming that the splice feature is L, the sum and variance of the splice features calculated by the terminal are u and δ, respectively, and the result after normalization processing is L' = (L-u)/δ.

S804, processing the spliced characteristic obtained by normalization processing through a speaker-independent voice enhancement model.

In one embodiment, the terminal inputs the spliced features obtained by the normalization process into a speaker-independent speech enhancement model, and the spliced features obtained by the normalization process are subjected to the speech enhancement modelAnd processing to obtain the enhanced features. The enhanced features may be expressed as L' _enhance ＝s(L')。

S806, performing inverse normalization processing on the output processed by the voice enhancement model.

In one embodiment, the terminal obtains the mean and variance of the splice features used in the normalization process, and performs inverse normalization processing on the output processed by the speech enhancement model through the mean and variance. The inverse normalized calculation may be L '= (L' _enhance +u)×δ。

S808, converting the result after the inverse normalization processing to obtain the time domain target voice after voice enhancement.

In one embodiment, before the step of converting the result after the inverse normalization process, the terminal performs an exponential operation on the result after the inverse normalization process to obtain an amplitude spectrum in the frequency domain.

In one embodiment, S808 may specifically include: and the terminal performs exponential operation on the inverse normalization result, calculates a time spectrum according to the amplitude spectrum obtained by the operation and the corresponding phase information, and then performs Fourier inverse transformation on the time spectrum to obtain the time domain target voice subjected to voice enhancement.

As an example, as shown in fig. 9, when the speech enhancement model is a bi-directional LSTM model, a noisy speech signal of a speaker is collected from a target environment, a speech feature and an i-vector are extracted from the noisy speech signal, and the extracted speech feature and the i-vector are spliced and normalized. And inputting the spliced feature vectors into a bidirectional LSTM model for voice enhancement processing. And carrying out inverse normalization processing and characteristic inverse transformation on the output of the bidirectional LSTM model, thereby obtaining a clean voice signal without noise.

In the above embodiment, before the splicing feature is input into the speaker-independent speech enhancement model, normalization processing is performed on the splicing feature, so that the convergence speed of the speech enhancement model is increased when the splicing feature is processed, the time for obtaining the time domain target speech is shortened, and the speech enhancement efficiency is improved.

FIG. 1 is a flow chart of a method of speech enhancement in one embodiment. It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

As shown in FIG. 10, in one embodiment, a model training method is provided. The embodiment is mainly exemplified by the method applied to a terminal, wherein the terminal can be a mobile phone, a computer or an intelligent robot. Referring to fig. 10, the model training method specifically includes the steps of:

S1002, acquiring a noise voice sample and a noiseless voice sample; the noisy speech samples and the noiseless speech samples correspond to the same speaker.

S1004, extracting training voice characteristics from the noise voice samples.

S1006, training reference voice characteristics and training identity characteristics for identifying the acoustic identity of the speaker are extracted from the noiseless voice sample.

Wherein the second training identity characteristic of the speaker's acoustic identity is referred to as i-vector.

For the way to extract the training reference speech features from the noiseless speech samples, reference may be made to S1004, which is not described here again.

For the way to extract training identity features for identifying the speaker's acoustic identity in the noise-free speech samples, the i-vector is extracted by GMM-UBM (Gaussian Mixture Model-Universal Back-group Model, gaussian mixture Model-generic background Model) approach, as shown in fig. 3. The extraction method of the i-vector can be referred to as the following formula:

M＝m+Tw

wherein M represents a mean supervector of the Gaussian mixture model, M is a supervector irrelevant to a speaker and a channel, the overall transformation matrix T is a mapping from a high-dimensional space to a low-dimensional space, and w is the extracted i-vector.

γ _kt ⁽ⁱ⁾ ＝p(k|x _t ⁽ⁱ⁾ )

S1008, splicing the training voice features and the training identity features to obtain training splicing features.

For example, the terminal extracts 100 frames of noise speech samples, and obtains training speech features from each frame of noise speech samples. Where the dimension of each frame training speech feature may be 257. The dimension of the i-vector may be 100. The terminal can splice an i-vector for each M frames of training speech features, M being a positive integer greater than or equal to 1.

S1010, training the speech enhancement model by taking training splice features as training input and training reference speech features as training output.

In one embodiment, S1006 may specifically include: extracting training reference voice characteristics from a noise-free voice sample; and processing training reference voice characteristics through the identity characteristic extraction model, and reducing the dimension of the processed result to obtain training identity characteristics for identifying the acoustic identity of the speaker.

In one embodiment, the method further comprises: acquiring voice; extracting voice features from voice; determining an identity feature for identifying the acoustic identity of the speaker according to the voice; splicing the voice features and the identity features to obtain spliced features; and processing the spliced characteristic through a speaker-independent voice enhancement model to obtain the target voice subjected to voice enhancement.

In one embodiment, the step of extracting speech features from speech comprises: framing and windowing the voice; converting each frame of voice obtained after processing to obtain the frequency spectrum of each frame of voice; the speech characteristics are determined from the spectrum of each frame of speech.

In one embodiment, the step of determining speech features from the spectrum of each frame of speech comprises: determining a power spectrum according to the frequency spectrum of each frame of voice; obtaining a logarithmic power spectrum corresponding to the power spectrum; the log power spectrum is determined as a speech feature, or the result of discrete cosine transform of the log power spectrum is determined as a speech feature.

In one embodiment, the step of determining an identity feature for identifying the acoustic identity of the speaker from the speech may comprise: processing the extracted voice features through an identity feature extraction model to obtain an overall transformation matrix corresponding to the identity features of the speaker acoustic identity; extracting identity characteristic parameters from the voice according to the overall transformation matrix; and reducing the dimension of the extracted identity characteristic parameters to obtain the identity characteristics for identifying the acoustic identity of the speaker.

In one embodiment, the method may further comprise: extracting a third training voice feature from the acquired noise voice sample; inputting the extracted third training voice features into an identity feature extraction model to obtain a training overall transformation matrix corresponding to the identity features of the acoustic identity of the speaker; extracting training identity characteristic parameters from the voice according to the training overall transformation matrix; and adjusting the identity characteristic extraction model according to the difference between the training identity characteristic parameter and the preset target identity characteristic parameter until the training stopping condition is met.

In one embodiment, the step of processing the splice features through a speaker independent speech enhancement model to obtain speech enhanced target speech may include: normalizing the splicing characteristics; processing the splicing characteristics obtained by normalization processing through a speaker-independent voice enhancement model; performing inverse normalization processing on the output processed by the voice enhancement model; and converting the result after the inverse normalization processing to obtain the time domain target voice after voice enhancement.

In the above embodiment, the training identity feature and the training reference voice feature for identifying the acoustic identity of the speaker are obtained, so that the training identity feature and the training reference voice feature are spliced and then input into the voice enhancement model, and the voice enhancement model is trained to obtain the voice enhancement model capable of predicting the identity feature.

FIG. 10 is a flow diagram of a model processing method in one embodiment. It should be understood that, although the steps in the flowchart of fig. 10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 10 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, or the order in which the sub-steps or stages are performed is not necessarily sequential, but may be performed in rotation or alternatively with at least a portion of the sub-steps or stages of other steps or steps.

Because of the differences in speech characteristics from person to person, many different person voices are used as training data in training a speech recognition system, which is known as speaker independent SI training. In the conventional approach, for the SI-trained speech enhancement model, the gaussian mixture model for each tone would cover the feature vectors of all persons, so that it is also expected that this gaussian mixture model could cover the feature vectors of a new speaker at the time of testing. Because the SI training speech enhancement model is not obtained by the speech training of the speaker, the effect is poor by adopting the SI training speech enhancement model for speech enhancement.

In addition, for traditional SD (speaker dependent ) training, a gaussian mixture model is trained for each individual's voice to fit the data. At the time of testing, each speaker is identified using its own speech enhancement model. Since the gaussian mixture model used is trained from the data of the speaker himself, the SD-trained speech enhancement model is used for speech enhancement, which is much better than the SI-trained model. However, in the SD training process, a large amount of data needs to be collected, and a model trained by the speaker himself needs to be used to realize speech enhancement.

In order to solve the above problem, an embodiment of the present invention proposes a language enhancement method, as shown in fig. 11, which may include:

s1102, collecting speaker voice carrying noise, framing and windowing the collected speaker voice, and extracting voice characteristics of the speaker voice.

The extracted speech features may be, among other things, a logarithmic power spectrum or MFCC (Mel-Frequency Cepstral Coefficients, mel-frequency cepstral coefficient) illustrating the speech of the speaker.

For example, as shown in fig. 2, the collected voices are subjected to framing and windowing, FFT (Fast Fourier Transformation, fast fourier transform) is performed on each frame of voices, the power spectrum after FFT is obtained, the logarithm of the obtained power spectrum is obtained, and the logarithmic power spectrum is obtained, so that the voice characteristics of the speaker can be obtained.

S1104, extracting the i-vector from the collected speaker voice carrying noise.

And S1106, respectively splicing the extracted voice features and the i-vector to obtain spliced features.

For example, as shown in fig. 4, M frames of speaker speech are extracted, speech features are obtained from each frame, and each frame of speech features has a dimension of 257. The dimension of the i-vector is 100.M is a positive integer greater than or equal to 1.

S1108, inputting the spliced features into the trained bidirectional LSTM model.

In one embodiment, the terminal normalizes the splice features, and inputs the splice features after normalization into a bi-directional LSTM model for speech enhancement.

S1110, processing the splicing characteristic through a bidirectional LSTM model to obtain a voice signal without noise.

And performing voice enhancement processing on the input spliced characteristics through the trained bidirectional LSTM model, performing inverse normalization processing on the output of the bidirectional LSTM model, performing characteristic inverse transformation, and realizing conversion from a frequency domain to a time domain to obtain enhanced time domain voice.

The training method for the bidirectional LSTM model can comprise the following steps: a noisy speech sample and a non-noisy speech sample are obtained, training speech features for training are extracted from the noisy speech sample, and training reference speech features and i-vectors are extracted from the non-noisy speech sample. And combining the i-Vector with the training voice characteristics to obtain a combined characteristic Vector. And training the bidirectional LSTM model by taking the combined feature vector as a training input and taking training reference voice features as training output, and continuously adjusting parameters in the bidirectional LSTM model, so as to obtain the finally required bidirectional LSTM model.

The implementation of the embodiment of the invention has the following beneficial effects:

in the training stage, the voice of each speaker does not need to be collected, and the workload is reduced. Taking voices irrelevant to speakers as training samples, and training a bidirectional LSTM model by utilizing common points among voices of all speakers; during testing, the speaker does not need to know who is, a specific bidirectional LSTM model does not need to be selected, and the bidirectional LSTM model is directly matched with the collected speaker voice; when the voice enhancement is carried out, the voice enhancement processing can be carried out on the voice of the detector without adopting a bidirectional LSTM model trained by the voice of the detector.

In addition, as the i-vector is added as input when the bidirectional LSTM model is trained, the bidirectional LSTM model obtained after training has the function of predicting the voice characteristics of other speakers, thereby improving the voice enhancement effect of the SI model.

As shown in fig. 12, in one embodiment, there is provided a voice enhancement apparatus, which specifically includes: a voice acquisition module 1202, a voice feature extraction module 1204, an identity feature determination module 1206, a feature stitching module 1208, and a processing module 1210; wherein:

a voice acquisition module 1202 for acquiring voice;

a voice feature extraction module 1204, configured to extract voice features from the voice;

an identity determination module 1206 for determining an identity for identifying the acoustic identity of the speaker from the speech;

a feature stitching module 1208, configured to stitch the voice feature and the identity feature to obtain a stitched feature;

the processing module 1210 is configured to process the concatenation feature through a speaker independent speech enhancement model to obtain a speech enhanced target speech.

The voice enhancement model is trained according to training splicing characteristics; the training splice features are formed by splicing training voice features extracted from noise voice samples and training identity features extracted from noise-free voice samples; the noise speech samples and the noiseless speech samples correspond to the same speaker.

In one embodiment, the voice feature extraction module 1204 is further configured to frame and window the voice; converting each frame of voice obtained after processing to obtain the frequency spectrum of each frame of voice; and determining the voice characteristics according to the frequency spectrums of the voice frames.

In one embodiment, the voice feature extraction module 1204 is further configured to determine a power spectrum from the spectrum of each frame of voice; obtaining a logarithmic power spectrum corresponding to the power spectrum; and determining the logarithmic power spectrum as a voice characteristic or determining a result obtained by discrete cosine transform of the logarithmic power spectrum as a voice characteristic.

In one embodiment, as shown in fig. 13, the apparatus further comprises: a sample acquisition module 1212, a training feature extraction module 1214, a training feature stitching module 1216, and a first training module 1218; wherein:

a sample acquisition module 1212 for acquiring noise speech samples and noise-free speech samples; the noise voice sample and the noiseless voice sample correspond to the same speaker;

a training feature extraction module 1214, configured to extract training speech features from the noise speech samples; extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from the noiseless voice sample;

the training feature stitching module 1216 is configured to stitch the training speech feature and the training identity feature to obtain a training stitching feature;

the first training module 1218 is configured to train the speech enhancement model with the training concatenation feature as a training input and the training reference speech feature as a training output.

In one embodiment, the identity feature determination module 1206 is further configured to process the extracted speech features through an identity feature extraction model to obtain an overall transformation matrix corresponding to the identity features of the speaker's acoustic identity; extracting identity characteristic parameters from the voice according to the overall transformation matrix; and reducing the dimension of the extracted identity characteristic parameters to obtain the identity characteristics for identifying the acoustic identity of the speaker.

In one embodiment, the apparatus further comprises a second training module 1220; wherein: the second training module 1220 is configured to extract a third training speech feature from the obtained noise speech sample; inputting the extracted third training voice features into an identity feature extraction model to obtain a training overall transformation matrix corresponding to the identity features of the acoustic identity of the speaker; extracting training identity characteristic parameters from the voice according to the training overall transformation matrix; and adjusting the identity feature extraction model according to the difference between the training identity feature parameters and the preset target identity feature parameters until the training stopping condition is met.

In one embodiment, the processing module 1210 is further configured to normalize the stitching feature; processing the splicing characteristics obtained by normalization processing through a speaker-independent voice enhancement model; performing inverse normalization processing on the output processed by the voice enhancement model; and converting the result after the inverse normalization processing to obtain the time domain target voice after voice enhancement.

As shown in fig. 14, in one embodiment, there is provided a model training apparatus, the model training specifically including: a sample acquisition module 1402, a training feature extraction module 1404, a training feature stitching module 1406, and a training module 1408; wherein:

a sample acquisition module 1402 for acquiring noise speech samples and noise-free speech samples; the noise voice sample and the noiseless voice sample correspond to the same speaker;

a training feature extraction module 1404, configured to extract training speech features from the noise speech samples; extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from the noiseless voice sample;

a training feature stitching module 1406, configured to stitch the training speech feature and the training identity feature to obtain a training stitching feature;

training module 1408 is configured to train the speaker independent speech enhancement model with the training concatenation feature as a training input and the training reference speech feature as a training output.

In one embodiment, training feature extraction module 1404 is also used to extract training reference speech features in the noiseless speech samples; and processing the training reference voice features through an identity feature extraction model, and performing dimension reduction on the processed result to obtain training identity features for identifying the acoustic identity of the speaker.

FIG. 15 illustrates an internal block diagram of a computer device in one embodiment. As shown in fig. 15, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program which, when executed by a processor, causes the processor to implement a speech enhancement method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform the speech enhancement method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 15 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the speech enhancement apparatus provided by the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 15. The memory of the computer device may store the various program modules that make up the speech enhancement apparatus, such as the speech acquisition module 1202, the speech feature extraction module 1204, the identity feature determination module 1206, the feature stitching module 1208, and the processing module 1210. The computer program of each program module causes a processor to execute the steps of the speech enhancement method of each embodiment of the present application described in the present specification.

For example, the computer apparatus shown in fig. 15 may perform S102 by the voice acquisition module 1202 in the voice enhancement device as shown in fig. 12. The computer device may perform S104 through the speech feature extraction module 1204. The computer device may perform S106 through the identity determination module 1206. The computer device may perform S108 through the feature stitching module 1208. The computer device may perform S110 through the processing module 1210.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: acquiring voice; extracting voice features from voice; determining an identity feature for identifying the acoustic identity of the speaker according to the voice; splicing the voice features and the identity features to obtain spliced features; and processing the spliced characteristic through a speaker-independent voice enhancement model to obtain the target voice subjected to voice enhancement.

In one embodiment, the computer program, when executed by the processor, causes the processor to perform the steps of: framing and windowing the voice; converting each frame of voice obtained after processing to obtain the frequency spectrum of each frame of voice; the speech characteristics are determined from the spectrum of each frame of speech.

In one embodiment, the computer program, when executed by the processor, causes the processor to perform the step of determining speech features from the spectrum of each frame of speech, specifically performing the steps of: determining a power spectrum according to the frequency spectrum of each frame of voice; obtaining a logarithmic power spectrum corresponding to the power spectrum; the log power spectrum is determined as a speech feature, or the result of discrete cosine transform of the log power spectrum is determined as a speech feature.

In one embodiment, the computer program, when executed by the processor, causes the processor to further perform the steps of: acquiring a noise voice sample and a noiseless voice sample; the noise voice sample and the noiseless voice sample correspond to the same speaker; extracting training voice characteristics from noise voice samples; extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from a noise-free voice sample; splicing training voice characteristics and training identity characteristics to obtain training splicing characteristics; training splice features are used as training inputs, training reference voice features are used as training outputs, and a voice enhancement model is trained.

In one embodiment, the computer program, when executed by the processor, causes the processor to perform the step of determining an identity feature for identifying the acoustic identity of the speaker from the speech, specifically performing the steps of: processing the extracted voice features through an identity feature extraction model to obtain an overall transformation matrix corresponding to the identity features of the speaker acoustic identity; extracting identity characteristic parameters from the voice according to the overall transformation matrix; and reducing the dimension of the extracted identity characteristic parameters to obtain the identity characteristics for identifying the acoustic identity of the speaker.

In one embodiment, the computer program, when executed by the processor, causes the processor to further perform the steps of: extracting a third training voice feature from the acquired noise voice sample; inputting the extracted third training voice features into an identity feature extraction model to obtain a training overall transformation matrix corresponding to the identity features of the acoustic identity of the speaker; extracting training identity characteristic parameters from the voice according to the training overall transformation matrix; and adjusting the identity characteristic extraction model according to the difference between the training identity characteristic parameter and the preset target identity characteristic parameter until the training stopping condition is met.

In one embodiment, the computer program, when executed by the processor, causes the processor to perform the step of obtaining speech-enhanced target speech by processing the splice features through a speaker-independent speech enhancement model, to specifically perform the steps of: normalizing the splicing characteristics; processing the splicing characteristics obtained by normalization processing through a speaker-independent voice enhancement model; performing inverse normalization processing on the output processed by the voice enhancement model; and converting the result after the inverse normalization processing to obtain the time domain target voice after voice enhancement.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of: acquiring voice; extracting voice features from voice; determining an identity feature for identifying the acoustic identity of the speaker according to the voice; splicing the voice features and the identity features to obtain spliced features; and processing the spliced characteristic through a speaker-independent voice enhancement model to obtain the target voice subjected to voice enhancement.

FIG. 16 illustrates an internal block diagram of a computer device in one embodiment. As shown in fig. 16, the computer device includes a processor, a memory, a network interface, an input device, and a display screen connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by a processor, causes the processor to implement a model training method. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform the model training method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 16 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, the model training apparatus provided by the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 16. The memory of the computer device may store the various program modules that make up the model training apparatus, such as the sample acquisition module 1402, training feature extraction module 1404, training feature stitching module 1406, and training module 1408 shown in FIG. 14. The computer program of each program module causes the processor to carry out the steps in the model training method of each embodiment of the present application described in the present specification.

For example, the computer apparatus shown in fig. 16 may perform S1002 by the sample acquisition module 1402 in the model training apparatus as shown in fig. 14. The computer device may perform S1004 and S1006 by training feature extraction module 1404. The computer device may perform steps S1008 and S1010 through the training feature stitching module 1406. The computer device may perform S1012 through training module 1408.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: acquiring a noise voice sample and a noiseless voice sample; the noise voice sample and the noiseless voice sample correspond to the same speaker; extracting training voice characteristics from noise voice samples; extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from a noise-free voice sample; splicing training voice characteristics and training identity characteristics to obtain training splicing characteristics; training splice features are used as training inputs, training reference speech features are used as training outputs, and a speaker-independent speech enhancement model is trained.

In one embodiment, the computer program, when executed by the processor, causes the processor to perform the steps of extracting training reference speech features and training identity features for identifying the acoustic identity of a speaker in a noise-free speech sample, specifically performing the steps of: extracting training reference voice characteristics from a noise-free voice sample; and processing training reference voice characteristics through the identity characteristic extraction model, and reducing the dimension of the processed result to obtain training identity characteristics for identifying the acoustic identity of the speaker.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of: acquiring a noise voice sample and a noiseless voice sample; the noise voice sample and the noiseless voice sample correspond to the same speaker; extracting training voice characteristics from noise voice samples; extracting training reference voice characteristics and training identity characteristics for identifying the acoustic identity of a speaker from a noise-free voice sample; splicing training voice characteristics and training identity characteristics to obtain training splicing characteristics; training splice features are used as training inputs, training reference speech features are used as training outputs, and a speaker-independent speech enhancement model is trained.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method of speech enhancement, comprising:

acquiring voice;

extracting speech features from the speech;

processing the splicing characteristics through a speaker-independent voice enhancement model to obtain voice enhanced target voice; the voice enhancement model is trained according to training splicing characteristics; the training splice features are formed by splicing training voice features extracted from noise voice samples and training identity features extracted from noise-free voice samples; the noise speech samples and the noiseless speech samples correspond to the same speaker.

2. The method of claim 1, wherein the obtaining speech comprises:

receiving voice carrying noise sent by remote equipment; or alternatively, the process may be performed,

background noise-carrying speech in the environment is picked up by a microphone.

3. The method of claim 1, wherein the extracting speech features from the speech comprises:

framing and windowing the voice;

converting each frame of voice obtained after processing to obtain the frequency spectrum of each frame of voice;

and determining the voice characteristics according to the frequency spectrums of the voice frames.

4. A method according to claim 3, wherein said determining speech features from the spectrum of each of said frames of speech comprises:

determining a power spectrum according to the frequency spectrum of each frame of voice;

obtaining a logarithmic power spectrum corresponding to the power spectrum;

and determining the logarithmic power spectrum as a voice characteristic or determining a result obtained by discrete cosine transform of the logarithmic power spectrum as a voice characteristic.

5. The method according to claim 1, wherein the method further comprises:

Extracting training voice characteristics from the noise voice samples;

and training the voice enhancement model by taking the training splicing characteristics as training input and the training reference voice characteristics as training output.

6. The method of any of claims 1 to 5, wherein said determining an identity feature for identifying an acoustic identity of a speaker from said speech comprises:

processing the extracted voice features through an identity feature extraction model to obtain an overall transformation matrix corresponding to the identity features of the speaker acoustic identity;

extracting identity characteristic parameters from the voice according to the overall transformation matrix;

and reducing the dimension of the extracted identity characteristic parameters to obtain the identity characteristics for identifying the acoustic identity of the speaker.

7. The method of claim 6, wherein the method further comprises:

extracting a third training voice feature from the acquired noise voice sample;

Inputting the extracted third training voice features into an identity feature extraction model to obtain a training overall transformation matrix corresponding to the identity features of the acoustic identity of the speaker;

extracting training identity characteristic parameters from the voice according to the training overall transformation matrix;

and adjusting the identity feature extraction model according to the difference between the training identity feature parameters and the preset target identity feature parameters until the training stopping condition is met.

8. The method of any one of claims 1 to 5, wherein the processing the splice features through a speaker independent speech enhancement model to obtain speech enhanced target speech comprises:

normalizing the spliced characteristics;

processing the splicing characteristics obtained by normalization processing through a speaker-independent voice enhancement model;

performing inverse normalization processing on the output processed by the voice enhancement model;

and converting the result after the inverse normalization processing to obtain the time domain target voice after voice enhancement.

9. A model training method, comprising:

Extracting training voice characteristics from the noise voice samples;

10. The method of claim 9, wherein the extracting training reference speech features and training identity features for identifying a speaker's acoustic identity in the noiseless speech samples comprises:

extracting training reference voice characteristics from the noiseless voice sample;

and processing the training reference voice features through an identity feature extraction model, and performing dimension reduction on the processed result to obtain training identity features for identifying the acoustic identity of the speaker.

11. A speech enhancement apparatus comprising:

the voice acquisition module is used for acquiring voice;

a voice feature extraction module for extracting voice features from the voice;

the processing module is used for processing the splicing characteristics through a speaker-independent voice enhancement model to obtain voice enhanced target voice; the voice enhancement model is trained according to training splicing characteristics; the training splice features are formed by splicing training voice features extracted from noise voice samples and training identity features extracted from noise-free voice samples; the noise speech samples and the noiseless speech samples correspond to the same speaker.

12. The apparatus of claim 11, wherein the speech feature extraction module is further configured to frame and window the speech; converting each frame of voice obtained after processing to obtain the frequency spectrum of each frame of voice; and determining the voice characteristics according to the frequency spectrums of the voice frames.

13. The apparatus of claim 12, wherein the speech feature extraction module is further configured to determine a power spectrum from the spectrum of each frame of speech; obtaining a logarithmic power spectrum corresponding to the power spectrum; and determining the logarithmic power spectrum as a voice characteristic or determining a result obtained by discrete cosine transform of the logarithmic power spectrum as a voice characteristic.

14. The apparatus of claim 11, wherein the voice acquisition module is further configured to receive noise-carrying voice from a remote device; alternatively, speech in the environment carrying background noise is picked up by a microphone.

15. The apparatus of claim 11, wherein the apparatus further comprises:

the first training module is used for taking the training splicing characteristics as training input, taking the training reference voice characteristics as training output and training the voice enhancement model.

16. The apparatus according to any one of claims 11 to 15, wherein the identity feature determination module is further configured to process the extracted speech features by means of an identity feature extraction model to obtain an overall transformation matrix corresponding to the identity features of the speaker's acoustic identity; extracting identity characteristic parameters from the voice according to the overall transformation matrix; and reducing the dimension of the extracted identity characteristic parameters to obtain the identity characteristics for identifying the acoustic identity of the speaker.

17. The apparatus of claim 16, wherein the apparatus further comprises:

the second training module is used for extracting third training voice characteristics from the acquired noise voice samples; inputting the extracted third training voice features into an identity feature extraction model to obtain a training overall transformation matrix corresponding to the identity features of the acoustic identity of the speaker; extracting training identity characteristic parameters from the voice according to the training overall transformation matrix; and adjusting the identity feature extraction model according to the difference between the training identity feature parameters and the preset target identity feature parameters until the training stopping condition is met.

18. The apparatus according to any one of claims 11 to 15, wherein the processing module is further configured to normalize the stitching feature; processing the splicing characteristics obtained by normalization processing through a speaker-independent voice enhancement model; performing inverse normalization processing on the output processed by the voice enhancement model; and converting the result after the inverse normalization processing to obtain the time domain target voice after voice enhancement.

19. A model training apparatus comprising:

20. The apparatus of claim 19, wherein the training feature extraction module is further configured to extract training reference speech features in the noiseless speech samples; and processing the training reference voice features through an identity feature extraction model, and performing dimension reduction on the processed result to obtain training identity features for identifying the acoustic identity of the speaker.

21. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 10.

22. A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 10.