CN117877507A

CN117877507A - Speech signal enhancement method, device, electronic equipment and storage medium

Info

Publication number: CN117877507A
Application number: CN202410005673.XA
Authority: CN
Inventors: 韩润强; 赵昊然; 吕新亮; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2024-01-02
Filing date: 2024-01-02
Publication date: 2024-04-12

Abstract

The present disclosure relates to a speech signal enhancement method, apparatus, electronic device, storage medium and computer program product. The method comprises the following steps: acquiring a voice signal set, a reference signal corresponding to the voice signal set and an initial enhanced voice signal; the spectrum information of each voice signal in the voice signal set, the spectrum information of the reference signal and the spectrum information of the initial enhanced voice signal are input into a first voice enhancement model after training is completed, and target spectrum information is obtained; the number of target spectral information is smaller than the number of spectral information input to the first speech enhancement model; inputting a target amplitude spectrum in the target spectrum information into a trained second voice enhancement model to obtain voice masking information; and according to the voice masking information, carrying out conversion processing on the frequency spectrum information of the target voice signals in the voice signal set to obtain target enhanced voice signals corresponding to the voice signal set. By adopting the method, the computational complexity during the enhancement of the voice signal can be reduced.

Description

Speech signal enhancement method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech processing technology, and in particular, to a speech signal enhancement method, apparatus, electronic device, storage medium, and computer program product.

Background

With the development of speech processing technology, speech signals are generally collected by a microphone array in a conference room. In order to improve the quality of the speech signal, enhancement processing is required for the speech signal.

In the related art, the current speech signal enhancement method mainly performs a series of processing on the spectrum information of each speech signal collected by the microphone array through the full-deep learning network to obtain complex masking information (such as complex masking value) of each speech signal, and then combines the spectrum information of each speech signal to obtain an enhanced speech signal. However, the whole network needs to perform a series of processing on the spectrum information of each voice signal, and output complex masking information of each voice signal, and meanwhile, when the final enhanced voice signal is obtained, the spectrum information of each voice signal needs to be combined, so that the computational complexity is higher when the voice signal is enhanced.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, a storage medium, and a computer program product for enhancing a speech signal, so as to at least solve the problem of high computational complexity during speech signal enhancement in the related art. The technical scheme of the present disclosure is as follows:

According to a first aspect of embodiments of the present disclosure, there is provided a speech signal enhancement method, including:

acquiring a voice signal set, a reference signal corresponding to the voice signal set and an initial enhanced voice signal corresponding to the voice signal set;

inputting the spectrum information of each voice signal in the voice signal set, the spectrum information of the reference signal and the spectrum information of the initial enhanced voice signal into a first voice enhancement model after training to obtain target spectrum information; the number of the target spectrum information is smaller than the number of spectrum information input to the first speech enhancement model;

inputting a target amplitude spectrum in the target frequency spectrum information into a trained second voice enhancement model to obtain voice masking information;

and according to the voice masking information, carrying out conversion processing on the spectrum information of the target voice signals in the voice signal set to obtain target enhanced voice signals corresponding to the voice signal set.

In an exemplary embodiment, the inputting the target magnitude spectrum in the target spectrum information into the trained second speech enhancement model to obtain speech masking information includes:

Inputting the target amplitude spectrum into a trained second voice enhancement model for first feature extraction processing to obtain initial audio features of the target amplitude spectrum;

performing second feature extraction processing on the initial audio features to obtain target audio features of the target amplitude spectrum;

and classifying the target audio features to obtain the voice masking information.

In an exemplary embodiment, the trained second speech enhancement model includes a speech branching network and an interference branching network;

the step of inputting the target amplitude spectrum into a trained second voice enhancement model for first feature extraction processing to obtain initial audio features of the target amplitude spectrum comprises the following steps:

inputting the target amplitude spectrum into the voice branch network for feature extraction processing to obtain a first audio feature of the target amplitude spectrum, and inputting the target amplitude spectrum into the interference branch network for feature extraction processing to obtain a second audio feature of the target amplitude spectrum;

performing fusion processing on the first audio feature and the second audio feature to obtain a first fusion audio feature;

And inputting the first fusion audio features into the voice branch network to perform feature extraction processing to obtain initial audio features of the target amplitude spectrum.

In an exemplary embodiment, the performing a second feature extraction process on the initial audio feature to obtain a target audio feature of the target amplitude spectrum includes:

inputting the initial audio features into the voice branch network for feature extraction processing to obtain a third audio feature of the target amplitude spectrum, and inputting the first fused audio features into the interference branch network for feature extraction processing to obtain a fourth audio feature of the target amplitude spectrum;

performing fusion processing on the third audio feature and the fourth audio feature to obtain a second fusion audio feature;

and inputting the second fusion audio features into the voice branch network for feature extraction processing to obtain target audio features of the target amplitude spectrum.

In an exemplary embodiment, the transforming, according to the speech masking information, the spectrum information of the target speech signal in the speech signal set to obtain the target enhanced speech signal corresponding to the speech signal set includes:

Performing fusion processing on the spectrum information of the target voice signal in the voice signal set and the voice masking information to obtain fused spectrum information of the target voice signal;

and carrying out transformation processing on the fused spectrum information to obtain a target enhanced voice signal corresponding to the voice signal set.

In an exemplary embodiment, the initial enhanced speech signal is obtained by:

inputting each voice signal and the reference signal in the voice signal set into a trained third voice enhancement model to obtain the initial enhanced voice signal;

the target amplitude spectrum in the target spectrum information is obtained by the following steps:

extracting an initial amplitude spectrum in the target spectrum information;

and converting the initial amplitude spectrum to obtain the target amplitude spectrum.

In an exemplary embodiment, the trained first speech enhancement model and the trained second speech enhancement model are trained by:

acquiring a sample voice signal set, a sample reference signal corresponding to the sample voice signal set and a sample initial enhanced voice signal corresponding to the sample voice signal set;

The spectrum information of each sample voice signal in the sample voice signal set, the spectrum information of the sample reference signal and the spectrum information of the sample initial enhancement voice signal are input into a first voice enhancement model to be trained, and sample target spectrum information is obtained;

inputting a sample target amplitude spectrum in the sample target frequency spectrum information into a second voice enhancement model to be trained to obtain predicted voice masking information and predicted interference masking information;

according to the predicted voice masking information, carrying out conversion processing on the spectrum information of the sample target voice signal in the sample voice signal set to obtain a predicted enhanced voice signal corresponding to the sample voice signal set, and according to the predicted interference masking information, carrying out conversion processing on the spectrum information of the sample target voice signal to obtain a predicted interference voice signal corresponding to the sample voice signal set;

and performing joint training on the first voice enhancement model to be trained and the second voice enhancement model to be trained according to the difference between the predicted enhanced voice signal and the clean voice signal corresponding to the sample voice signal set and the difference between the predicted interference voice signal and the interference voice signal corresponding to the sample voice signal set to obtain a first voice enhancement model after training and a second voice enhancement model after training.

According to a second aspect of embodiments of the present disclosure, there is provided a speech signal enhancement apparatus comprising:

a signal acquisition unit configured to perform acquisition of a set of speech signals, a reference signal corresponding to the set of speech signals, and an initial enhanced speech signal corresponding to the set of speech signals;

a first enhancement unit configured to perform inputting the spectral information of each speech signal in the speech signal set, the spectral information of the reference signal, and the spectral information of the initial enhanced speech signal into a trained first speech enhancement model to obtain target spectral information; the number of the target spectrum information is smaller than the number of spectrum information input to the first speech enhancement model;

a second enhancement unit configured to perform inputting a target amplitude spectrum in the target spectrum information into a trained second speech enhancement model to obtain speech masking information;

and the transformation processing unit is configured to perform transformation processing on the spectrum information of the target voice signals in the voice signal set according to the voice masking information to obtain target enhanced voice signals corresponding to the voice signal set.

In an exemplary embodiment, the second enhancement unit is further configured to perform a first feature extraction process of inputting the target amplitude spectrum into a trained second speech enhancement model, so as to obtain an initial audio feature of the target amplitude spectrum; performing second feature extraction processing on the initial audio features to obtain target audio features of the target amplitude spectrum; and classifying the target audio features to obtain the voice masking information.

the second enhancement unit is further configured to perform feature extraction processing on the target amplitude spectrum input into the voice branch network to obtain a first audio feature of the target amplitude spectrum, and input the target amplitude spectrum into the interference branch network to perform feature extraction processing to obtain a second audio feature of the target amplitude spectrum; performing fusion processing on the first audio feature and the second audio feature to obtain a first fusion audio feature; and inputting the first fusion audio features into the voice branch network to perform feature extraction processing to obtain initial audio features of the target amplitude spectrum.

In an exemplary embodiment, the second enhancement unit is further configured to perform a feature extraction process for inputting the initial audio feature into the voice branch network to obtain a third audio feature of the target amplitude spectrum, and input the first fused audio feature into the interfering branch network to perform a feature extraction process to obtain a fourth audio feature of the target amplitude spectrum; performing fusion processing on the third audio feature and the fourth audio feature to obtain a second fusion audio feature; and inputting the second fusion audio features into the voice branch network for feature extraction processing to obtain target audio features of the target amplitude spectrum.

In an exemplary embodiment, the transformation processing unit is further configured to perform fusion processing on the spectrum information of the target voice signal in the voice signal set and the voice masking information, so as to obtain fused spectrum information of the target voice signal; and carrying out transformation processing on the fused spectrum information to obtain a target enhanced voice signal corresponding to the voice signal set.

In an exemplary embodiment, the apparatus further includes an initial enhancement unit configured to perform inputting each of the speech signals in the speech signal set and the reference signal into a trained third speech enhancement model to obtain the initial enhanced speech signal;

The apparatus further includes a conversion processing unit configured to perform extraction of an initial magnitude spectrum in the target spectrum information; and converting the initial amplitude spectrum to obtain the target amplitude spectrum.

In an exemplary embodiment, the apparatus further comprises a model training unit configured to perform obtaining a set of sample speech signals, a sample reference signal corresponding to the set of sample speech signals, and a sample initial enhanced speech signal corresponding to the set of sample speech signals; the spectrum information of each sample voice signal in the sample voice signal set, the spectrum information of the sample reference signal and the spectrum information of the sample initial enhancement voice signal are input into a first voice enhancement model to be trained, and sample target spectrum information is obtained; inputting a sample target amplitude spectrum in the sample target frequency spectrum information into a second voice enhancement model to be trained to obtain predicted voice masking information and predicted interference masking information; according to the predicted voice masking information, carrying out conversion processing on the spectrum information of the sample target voice signal in the sample voice signal set to obtain a predicted enhanced voice signal corresponding to the sample voice signal set, and according to the predicted interference masking information, carrying out conversion processing on the spectrum information of the sample target voice signal to obtain a predicted interference voice signal corresponding to the sample voice signal set; and performing joint training on the first voice enhancement model to be trained and the second voice enhancement model to be trained according to the difference between the predicted enhanced voice signal and the clean voice signal corresponding to the sample voice signal set and the difference between the predicted interference voice signal and the interference voice signal corresponding to the sample voice signal set to obtain a first voice enhancement model after training and a second voice enhancement model after training.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech signal enhancement method according to any of the preceding claims.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the speech signal enhancement method as set forth in any one of the above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions therein, which when executed by a processor of an electronic device, enable the electronic device to perform the speech signal enhancement method as set forth in any one of the preceding claims.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

firstly, a voice signal set, a reference signal corresponding to the voice signal set and an initial enhancement voice signal corresponding to the voice signal set are acquired, and then the spectrum information of each voice signal in the voice signal set, the spectrum information of the reference signal and the spectrum information of the initial enhancement voice signal are input into a first voice enhancement model after training is completed to obtain target spectrum information; the number of target spectral information is smaller than the number of spectral information input to the first speech enhancement model; then inputting a target amplitude spectrum in the target spectrum information into a trained second voice enhancement model to obtain voice masking information; and finally, according to the voice masking information, carrying out conversion processing on the frequency spectrum information of the target voice signals in the voice signal set to obtain target enhanced voice signals corresponding to the voice signal set. In this way, when the voice signal enhancement is performed, the first voice enhancement model is utilized to output the target spectrum information with reduced quantity, then the second voice enhancement model is utilized to process the target amplitude spectrum in the target spectrum information, so as to obtain the voice masking information, namely, the target spectrum information with reduced quantity is output firstly, and then the target amplitude spectrum in the target spectrum information is processed, so that a series of processing is not required to be performed on the spectrum information of each voice signal, and complex masking information of each voice signal is not required to be output, thereby simplifying the voice signal enhancement process and further reducing the calculation complexity during the voice signal enhancement. Meanwhile, when the target enhanced voice signal is obtained, only the output voice masking information and the spectrum information of the target voice signal in the voice signal set are needed to be utilized, and complex masking information and spectrum information of each voice signal are not needed to be considered, so that the computational complexity in the voice signal enhancement process is further reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a flow chart illustrating a method of speech signal enhancement according to an exemplary embodiment.

Fig. 2 is a block diagram of a microphone array shown in accordance with an exemplary embodiment.

Fig. 3 is a flow chart illustrating the retrieval of a speech mask and an interference mask according to an exemplary embodiment.

Fig. 4 is a flowchart illustrating steps for obtaining speech masking information, according to an exemplary embodiment.

FIG. 5 is a flowchart illustrating training steps for a first speech enhancement model and a second speech enhancement model, according to an exemplary embodiment.

Fig. 6 is a flowchart illustrating another speech signal enhancement method according to an exemplary embodiment.

Fig. 7 is a block diagram illustrating a speech signal enhancement apparatus according to an exemplary embodiment.

Fig. 8 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be further noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for presentation, analyzed data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party.

Fig. 1 is a flowchart illustrating a voice signal enhancement method according to an exemplary embodiment, which is used in a terminal as shown in fig. 1; it will be appreciated that the method may also be applied to a server, and may also be applied to a system comprising a terminal and a server, and implemented by interaction between the terminal and the server. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones and tablet computers, and the server may be implemented by a separate server or a server cluster formed by a plurality of servers. In the present exemplary embodiment, the method includes the steps of:

in step S110, a set of speech signals, a reference signal corresponding to the set of speech signals, and an initial enhanced speech signal corresponding to the set of speech signals are acquired.

The voice signal set includes a plurality of voice signals, and specifically includes a plurality of near-end microphone signals, such as microphone signals collected by each microphone in the microphone array. In a meeting room scenario, the microphone array is typically placed at the forefront of the meeting room for improving the signal-to-noise ratio, and includes a number of microphones greater than or equal to 2, such as the linear microphone array and the annular microphone array including 6 microphones shown in fig. 2.

The reference signal refers to a far-end reference signal, and specifically refers to a voice signal received from a far-end, such as a far-end microphone signal. The reference signal is played by the near-end speaker and then collected by the near-end microphone to form an echo signal. It should be noted that the reference signal may also be referred to as a near-end speaker signal.

It should be noted that, when the near end (also referred to as the local end) and the far end (also referred to as other ends) perform voice communication, the microphone at the near end may collect the voice signal at the near end, and send the voice signal at the near end to the far end, and play the voice signal through the speaker at the far end; meanwhile, the speaker at the near end can play the voice signal sent by the far end, such as the voice signal at the far end collected by the microphone at the far end.

The initial enhanced speech signal refers to a speech signal for canceling a linear echo, and specifically refers to a speech signal obtained by processing each speech signal and a reference signal in a speech signal set by linear AEC (Acoustic Echo Cancellation ).

It should be noted that, the number of reference signals corresponding to the speech signal set and the number of initial enhanced speech signals corresponding to the speech signal set are 1. Assuming that the speech signal set includes 6 speech signals, it is explained that 8 channels of speech signals are acquired.

The terminal responds to the signal enhancement request to obtain a voice signal set to be enhanced and reference signals corresponding to the voice signal set, and performs linear echo cancellation processing on each voice signal and the reference signals in the voice signal set to obtain a voice signal with linear echo eliminated, and the voice signal is used as an initial enhanced voice signal corresponding to the voice signal set.

For example, referring to fig. 3, in a conference room scenario, a terminal acquires each microphone signal acquired by a microphone array and reference signals corresponding to the microphone signals, and then inputs the microphone signals and the reference signals into a linear acoustic echo cancellation model to perform linear echo cancellation processing, so as to obtain a voice signal for canceling linear echo, which is used as an initial enhanced voice signal corresponding to the microphone signals.

In step S120, the spectrum information of each speech signal in the speech signal set, the spectrum information of the reference signal and the spectrum information of the initial enhanced speech signal are input into the trained first speech enhancement model to obtain target spectrum information; the amount of target spectral information is smaller than the amount of spectral information input to the first speech enhancement model.

The spectrum information of the voice signal refers to a complex spectrum of the voice signal, specifically including a magnitude spectrum and a phase spectrum, and may be obtained by performing STFT (Short-Time Fourier Transform ) processing on the voice signal.

The spectrum information of the reference signal refers to a complex spectrum of the reference signal, specifically including a magnitude spectrum and a phase spectrum, and may be obtained by performing STFT processing on the reference signal.

The spectrum information of the initial enhancement voice signal refers to a complex spectrum of the initial enhancement voice signal, specifically including an amplitude spectrum and a phase spectrum, and can be obtained by performing STFT processing on the initial enhancement voice signal.

The first speech enhancement model refers to a network model for performing beam forming operations, such as complex CNN (Convolutional Neural Networks, convolutional neural network) shown in fig. 3. The first voice enhancement model can accurately locate a voice source, so that the problem that the direction pointed by the wave beam is the noise direction is avoided, and the target frequency spectrum information with the specified quantity is generated; the specified number is smaller than the number of the spectrum information input to the first voice enhancement model, so that the number of the spectrum information to be processed later is reduced, and the calculation complexity is reduced. For example, the number of spectrum information input to the first speech enhancement model is 8, and the number of target spectrum information output is 4. In a practical scenario, referring to fig. 3, the number of channels of the complex convolutional neural network for performing the beam forming operation and outputting the complex spectrum (i.e., the target spectrum information) can be regarded as the number of beam directors, such as 4 channels.

The target spectrum information refers to enhanced spectrum information regenerated through the first voice enhancement model, such as a target complex spectrum. The number of target spectrum information may be referred to as the number of channels of target spectrum information, such as 4, indicating that the number of channels of target spectrum information is 4.

The terminal performs STFT processing on each voice signal, the reference signal and the initial enhancement voice signal in the voice signal set respectively to obtain frequency spectrum information of each voice signal, frequency spectrum information of the reference signal and frequency spectrum information of the initial enhancement voice signal in the voice signal set; and then, the spectrum information of each voice signal in the voice signal set, the spectrum information of the reference signal and the spectrum information of the initial enhanced voice signal are input into a first voice enhancement model which is trained to carry out beam forming operation, so that enhanced spectrum information is obtained and is used as target spectrum information.

For example, referring to fig. 3, the terminal performs short-time fourier transform processing on each microphone signal, the reference signal, and the voice signal for removing the linear echo acquired by the microphone array, to obtain spectrum information of each microphone signal, spectrum information of the reference signal, and spectrum information of the voice signal for removing the linear echo; and then, the spectrum information of each microphone signal, the spectrum information of the reference signal and the spectrum information of the voice signal for eliminating the linear echo are input into a complex convolution neural network to carry out beam forming operation, so as to obtain target spectrum information.

In step S130, the target magnitude spectrum in the target spectrum information is input into the trained second speech enhancement model to obtain speech masking information.

The voice masking information refers to a voice masking matrix capable of realizing a clean voice extraction function, and is specifically used for extracting a clean voice signal in voice signals, and can be represented by a voice mask.

The target magnitude spectrum refers to a magnitude spectrum obtained after the logarithm operation is performed on an initial magnitude spectrum (i.e., an original magnitude spectrum) in the target spectrum information, and specifically refers to a logarithmic magnitude spectrum. By converting the initial amplitude spectrum into the target amplitude spectrum, the subsequent processing of the second speech enhancement model may be facilitated.

The second speech enhancement model refers to a network model for implementing noise reduction and echo cancellation, for example, a real-number-based fusion network for noise reduction and echo cancellation shown in fig. 3 may be implemented through a neural network or a deep learning network. The second speech enhancement model is specifically for outputting speech masking information and interference masking information. The interference masking information refers to an interference masking matrix capable of implementing an interference signal (such as a noise signal and an echo signal) extraction function, and is specifically used for extracting an interference signal in a voice signal, and can be represented by an interference mask.

In the prior art, when speech enhancement is performed, spectrum information is processed, and complex masking information of a plurality of speech signals is output; the second voice enhancement model disclosed by the invention processes the target amplitude spectrum in the target spectrum information, and only outputs one voice masking information, thereby being beneficial to reducing the calculation complexity.

The terminal converts the original amplitude spectrum in the target spectrum information to obtain a target amplitude spectrum, and then inputs the target amplitude spectrum into a trained second voice enhancement model to perform a series of processing to obtain voice masking information.

For example, referring to fig. 3, the terminal performs a logarithmic operation on an original amplitude spectrum in target spectrum information output by the complex convolutional neural network to obtain a logarithmic amplitude spectrum, and inputs the logarithmic amplitude spectrum to a noise reduction and echo cancellation fusion network to obtain a speech mask.

In step S140, according to the speech masking information, the spectral information of the target speech signal in the speech signal set is transformed, so as to obtain a target enhanced speech signal corresponding to the speech signal set.

The target voice signal refers to a voice signal with the highest voice effect in the voice signal set, and specifically refers to an intermediate voice signal in the voice signal set. For example, for a linear microphone array, the target voice signal refers to a voice signal collected by the middle microphone, and assuming that the linear microphone array includes 6 microphones, the target voice signal refers to a voice signal collected by the 3 rd or 4 th microphone; assuming that the linear microphone array includes 7 microphones, the target speech signal refers to the speech signal collected by the 4 th microphone. For the annular microphone array, the target voice signal refers to a voice signal collected by any one microphone.

The transformation process is ISTFT (Inverse Short-time Fourier transform) process.

The target enhanced speech signal refers to a final enhanced speech signal, specifically, a speech signal obtained by multiplying speech masking information with spectrum information of a target speech signal in a speech signal set and then performing ISTFT processing.

The terminal determines the spectrum information of the target voice signal from the spectrum information of each voice signal in the voice signal set, and multiplies the voice masking information with the spectrum information of the target voice signal to obtain processed spectrum information; and finally, carrying out ISTFT processing on the processed spectrum information to obtain target enhanced voice signals corresponding to the voice signal set.

For example, referring to fig. 3, the terminal multiplies the speech mask output by the noise reduction and echo cancellation fusion network by the spectrum information of the middle microphone signal in the microphone signals, and performs ISTFT processing to obtain the target enhanced microphone signal.

In the above voice signal enhancement method, a voice signal set, a reference signal corresponding to the voice signal set and an initial enhancement voice signal corresponding to the voice signal set are acquired first, then the spectrum information of each voice signal in the voice signal set, the spectrum information of the reference signal and the spectrum information of the initial enhancement voice signal are input into a first voice enhancement model after training is completed, and target spectrum information is obtained; the number of target spectral information is smaller than the number of spectral information input to the first speech enhancement model; then inputting a target amplitude spectrum in the target spectrum information into a trained second voice enhancement model to obtain voice masking information; and finally, according to the voice masking information, carrying out conversion processing on the frequency spectrum information of the target voice signals in the voice signal set to obtain target enhanced voice signals corresponding to the voice signal set. In this way, when the voice signal enhancement is performed, the first voice enhancement model is utilized to output the target spectrum information with reduced quantity, then the second voice enhancement model is utilized to process the target amplitude spectrum in the target spectrum information, so as to obtain the voice masking information, namely, the target spectrum information with reduced quantity is output firstly, and then the target amplitude spectrum in the target spectrum information is processed, so that a series of processing is not required to be performed on the spectrum information of each voice signal, and complex masking information of each voice signal is not required to be output, thereby simplifying the voice signal enhancement process and further reducing the calculation complexity during the voice signal enhancement. Meanwhile, when the target enhanced voice signal is obtained, only the output voice masking information and the spectrum information of the target voice signal in the voice signal set are needed to be utilized, and complex masking information and spectrum information of each voice signal are not needed to be considered, so that the computational complexity in the voice signal enhancement process is further reduced.

In an exemplary embodiment, as shown in fig. 4, in step S130, a target amplitude spectrum in the target spectrum information is input into a trained second speech enhancement model to obtain speech masking information, which may be specifically implemented by the following steps:

in step S410, the target amplitude spectrum is input into the trained second speech enhancement model for performing a first feature extraction process, so as to obtain an initial audio feature of the target amplitude spectrum.

In step S420, a second feature extraction process is performed on the initial audio feature to obtain a target audio feature of the target amplitude spectrum.

In step S430, the target audio feature is classified to obtain speech masking information.

The initial audio feature refers to a shallow audio feature obtained by performing convolution processing on the target amplitude spectrum, such as an output result of a fourth convolution layer in fig. 3. The first feature extraction process is used to obtain initial audio features of the target amplitude spectrum.

The target audio feature is a deep audio feature obtained by continuously optimizing the initial audio feature, such as an output result of the third gating cycle unit (GRU, gate Recurrent Unit) in fig. 3; of course, the target audio feature may include the output result of the first gating loop unit and the output result of the second gating loop unit in fig. 3, in addition to the output result of the third gating loop unit in fig. 3. The second feature extraction process is used to obtain a target audio feature of the target amplitude spectrum.

The classification processing is used for classifying the voice masking information, namely, the processing result of the classification processing is the voice masking information. For example, referring to fig. 3, a voice mask may be output through a first Full Connection (FC) and an activation function (e.g., sigmoid function).

The terminal inputs the target amplitude spectrum into a second voice enhancement model after training, and performs first feature extraction processing on the target amplitude spectrum through the second voice enhancement model to obtain initial audio features of the target amplitude spectrum; then, carrying out second feature extraction processing on the initial audio features of the target amplitude spectrum to obtain target audio features of the target amplitude spectrum; and finally, classifying and activating the target audio features to obtain voice masking information. The activation process is mainly used for normalization.

For example, referring to fig. 3, the terminal inputs the log-magnitude spectrum into a noise reduction and echo cancellation fusion network, and convolves the log-magnitude spectrum through a convolution layer in the noise reduction and echo cancellation fusion network to obtain an audio feature output by a fourth convolution layer, which is used as an initial audio feature of the log-magnitude spectrum; then, performing feature optimization processing on the initial audio features of the logarithmic magnitude spectrum through a gating circulation unit in a noise reduction and echo cancellation fusion network to obtain audio features output by a first gating circulation unit, audio features output by a second gating circulation unit and audio features output by a third gating circulation unit, and performing splicing processing on the audio features output by the first gating circulation unit, the audio features output by the second gating circulation unit and the audio features output by the third gating circulation unit to obtain spliced audio features serving as target audio features of the logarithmic magnitude spectrum; and finally, carrying out classification processing and activation processing on the target audio features of the log-magnitude spectrum through a first full-connection layer and an activation function in the noise reduction and echo cancellation fusion network to obtain a voice mask.

According to the technical scheme provided by the embodiment of the disclosure, only the target amplitude spectrum in the target frequency spectrum information is input into the trained second voice enhancement model for feature extraction processing, and the frequency spectrum information of each voice signal is not required to be input into the model; and finally, only one voice masking information is output, and complex masking information of each voice signal is not required to be output, so that the calculation amount during voice signal enhancement is reduced greatly, and the calculation complexity during voice signal enhancement is reduced.

In an exemplary embodiment, the trained second speech enhancement model includes a speech branching network and an interference branching network; in step S410, the target amplitude spectrum is input into the trained second speech enhancement model to perform the first feature extraction process, so as to obtain the initial audio feature of the target amplitude spectrum, which may be specifically implemented by the following contents: inputting the target amplitude spectrum into a voice branch network for feature extraction processing to obtain a first audio feature of the target amplitude spectrum, and inputting the target amplitude spectrum into an interference branch network for feature extraction processing to obtain a second audio feature of the target amplitude spectrum; carrying out fusion processing on the first audio feature and the second audio feature to obtain a first fusion audio feature; and inputting the first fusion audio features into a voice branch network for feature extraction processing to obtain initial audio features of the target amplitude spectrum.

The voice branch network is mainly used for obtaining voice masking information (i.e. voice mask), and specifically comprises a convolution layer, a gating circulation unit, a full connection layer and an activation function, wherein the number of the convolution layer and the gating circulation unit can be multiple. For example, referring to fig. 3, the voice branch network includes a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a first gating loop unit, a second gating loop unit, a third gating loop unit, a first full connection layer, and an activation function.

The interference branch network is mainly used for obtaining interference masking information (i.e. interference mask), and specifically includes a convolution layer, a gating circulation unit, a full connection layer and an activation function, where the number of the convolution layer and the gating circulation unit may be multiple. For example, referring to fig. 3, the interference branch network includes a fifth convolution layer, a sixth convolution layer, a seventh convolution layer, an eighth convolution layer, a fourth gating loop unit, a fifth gating loop unit, a sixth gating loop unit, a second full connection layer, and an activation function.

The voice branch network and the interference branch network are symmetrical in structure.

Wherein the first audio feature of the target amplitude spectrum refers to the audio feature output by the third convolution layer in fig. 3, and the second audio feature of the target amplitude spectrum refers to the audio feature output by the seventh convolution layer in fig. 3.

The first fused audio feature is an audio feature obtained by performing fusion processing (such as splicing processing) on the first audio feature and the second audio feature, such as splicing audio feature.

For example, referring to fig. 3, the terminal inputs the log-magnitude spectrum into the voice branch network, and performs multiple convolution processing on the log-magnitude spectrum through the first convolution layer, the second convolution layer and the third convolution layer, so as to obtain an audio feature output by the third convolution layer, and the audio feature is used as a first audio feature of the log-magnitude spectrum. Meanwhile, the terminal inputs the log-amplitude spectrum into an interference branch network, and carries out convolution processing on the log-amplitude spectrum for a plurality of times through a fifth convolution layer, a sixth convolution layer and a seventh convolution layer to obtain an audio characteristic output by the seventh convolution layer as a second audio characteristic of the log-amplitude spectrum. Then, the terminal performs splicing processing on the first audio feature and the second audio feature to obtain a spliced audio feature serving as a first fusion audio feature; and inputting the first fusion audio features into a fourth convolution layer, and carrying out convolution processing on the first fusion audio features through the fourth convolution layer to obtain audio features output by the fourth convolution layer, wherein the audio features are used as initial audio features of the logarithmic magnitude spectrum.

According to the technical scheme provided by the embodiment of the disclosure, when the initial audio feature of the target amplitude spectrum is obtained, the first audio feature of the target amplitude spectrum output by the voice branch network and the second audio feature of the target amplitude spectrum output by the interference branch network are comprehensively considered, so that the determination accuracy of the initial audio feature is improved, the subsequently obtained voice masking information is more accurate, the noise reduction and echo elimination quality is improved, and the voice enhancement effect is further improved.

In an exemplary embodiment, in step S420, the second feature extraction process is performed on the initial audio feature to obtain the target audio feature of the target amplitude spectrum, which may be specifically implemented by the following contents: inputting the initial audio features into a voice branch network for feature extraction processing to obtain a third audio feature of a target amplitude spectrum, and inputting the first fused audio features into an interference branch network for feature extraction processing to obtain a fourth audio feature of the target amplitude spectrum; fusing the third audio feature and the fourth audio feature to obtain a second fused audio feature; and inputting the second fusion audio features into a voice branch network for feature extraction processing to obtain target audio features of a target amplitude spectrum.

Wherein, the third audio feature of the target amplitude spectrum refers to the audio feature output by the second gating and circulating unit in fig. 3, and the fourth audio feature of the target amplitude spectrum refers to the audio feature output by the fifth gating and circulating unit in fig. 3.

The second fused audio feature is an audio feature obtained by performing fusion processing (such as splicing processing) on the third audio feature and the fourth audio feature, such as a spliced audio feature.

For example, referring to fig. 3, the terminal inputs the initial audio feature into the voice branch network, and performs feature optimization processing on the initial audio feature of the log-magnitude spectrum through the first gating circulation unit and the second gating circulation unit, so as to obtain the audio feature output by the second gating circulation unit as the third audio feature of the log-magnitude spectrum. Meanwhile, the terminal inputs the initial audio features into an interference branch network, and carries out convolution processing and feature optimization processing on the first fusion audio features through an eighth convolution layer, a fourth gating circulation unit and a fifth gating circulation unit to obtain audio features output by the fifth gating circulation unit as fourth audio features of the logarithmic magnitude spectrum. Then, the terminal performs splicing processing on the third audio feature and the fourth audio feature to obtain a spliced audio feature serving as a second fusion audio feature; and inputting the second fused audio features into a third gating circulation unit, and performing feature optimization processing on the second fused audio features through the third gating circulation unit to obtain audio features output by the third gating circulation unit. And finally, the terminal performs splicing processing on the audio features output by the first gating circulating unit, the audio features output by the second gating circulating unit and the audio features output by the third gating circulating unit to obtain spliced audio features serving as target audio features of the log-magnitude spectrum.

According to the technical scheme provided by the embodiment of the disclosure, when the target audio feature of the target amplitude spectrum is obtained, the third audio feature of the target amplitude spectrum output by the voice branch network and the fourth audio feature of the target amplitude spectrum output by the interference branch network are comprehensively considered, so that the determination accuracy of the target audio feature is improved, the subsequently obtained voice masking information is more accurate, the noise reduction and echo elimination quality is further improved, and the voice enhancement effect is further improved.

In an exemplary embodiment, in step S140, according to the speech masking information, the spectrum information of the target speech signal in the speech signal set is transformed to obtain the target enhanced speech signal corresponding to the speech signal set, which may be specifically implemented by: the method comprises the steps of performing fusion processing on spectrum information and voice masking information of a target voice signal in a voice signal set to obtain fused spectrum information of the target voice signal; and carrying out transformation processing on the fused spectrum information to obtain a target enhanced voice signal corresponding to the voice signal set.

Wherein the fusion process refers to multiplication. The fused spectrum information of the target speech signal refers to the multiplied spectrum information of the target speech signal. The conversion process refers to an ISTFT process.

The terminal multiplies the spectrum information of the target voice signal in the voice signal set with the voice masking information to obtain multiplied spectrum information of the target voice signal; and carrying out ISTFT processing on the multiplied spectrum information of the target voice signals to obtain target enhanced voice signals corresponding to the voice signal set.

According to the technical scheme provided by the embodiment of the disclosure, when the target enhanced voice signal is obtained, only the spectrum information of the target voice signal in the voice signal set and the output voice masking information are considered, and complex masking information and spectrum information of each voice signal are not required to be considered, so that the computational complexity in the process of enhancing the voice signal is further reduced. Meanwhile, the target voice signal with the highest voice effect in the voice signal set and the voice masking information output by the model are utilized, so that the voice enhancement effect is further improved.

In an exemplary embodiment, the initial enhanced speech signal is obtained by: and inputting each voice signal and the reference signal in the voice signal set into a trained third voice enhancement model to obtain an initial enhanced voice signal.

The third speech enhancement model refers to a network model for performing linear echo cancellation, such as a linear AEC model, and may be implemented through a convolutional neural network or a deep learning network.

The terminal inputs each voice signal and the reference signal in the voice signal set into the trained third voice enhancement model for linear echo cancellation processing to obtain a voice signal for canceling the linear echo, and the voice signal is used as an initial enhanced voice signal.

For example, referring to fig. 3, the terminal inputs each microphone signal and reference signal into a linear AEC model to obtain a voice signal to cancel the linear echo.

Further, the target amplitude spectrum in the target spectrum information is obtained by: extracting an initial amplitude spectrum in target frequency spectrum information; and converting the initial amplitude spectrum to obtain a target amplitude spectrum.

The conversion process refers to a logarithmic process.

The terminal extracts the real part of the target spectrum information, that is, the original magnitude spectrum in the target spectrum information, as an initial magnitude spectrum, and performs a logarithmic operation on the initial magnitude spectrum to obtain a target magnitude spectrum, that is, a target magnitude spectrum=log (initial magnitude spectrum).

According to the technical scheme provided by the embodiment of the disclosure, each voice signal and the reference signal in the voice signal set are input into a trained third voice enhancement model to obtain an initial enhanced voice signal; therefore, the voice signal is initially enhanced, so that the voice signal is enhanced for multiple times, and the voice enhancement effect can be further improved. In addition, the initial amplitude spectrum in the target frequency spectrum information is firstly extracted and then is converted into the target amplitude spectrum, so that the subsequent model only needs to process the target amplitude spectrum, the calculated amount of the subsequent model is reduced, and the calculation complexity in the process of enhancing the voice signal is further reduced.

In an exemplary embodiment, as shown in fig. 5, the method for enhancing a speech signal provided in the present disclosure further includes a training step of the first speech enhancement model and the second speech enhancement model, which may be specifically implemented by the following steps:

in step S510, a set of sample speech signals, a sample reference signal corresponding to the set of sample speech signals, and a sample initial enhanced speech signal corresponding to the set of sample speech signals are obtained.

In step S520, the spectrum information of each sample speech signal in the sample speech signal set, the spectrum information of the sample reference signal, and the spectrum information of the sample initial enhancement speech signal are input into the first speech enhancement model to be trained, so as to obtain sample target spectrum information.

In step S530, the sample target amplitude spectrum in the sample target spectrum information is input into the second speech enhancement model to be trained, so as to obtain predicted speech masking information and predicted interference masking information.

In step S540, according to the predicted speech masking information, the spectrum information of the sample target speech signal in the sample speech signal set is transformed to obtain a predicted enhanced speech signal corresponding to the sample speech signal set, and according to the predicted interference masking information, the spectrum information of the sample target speech signal is transformed to obtain a predicted interference speech signal corresponding to the sample speech signal set.

In step S550, the first speech enhancement model to be trained and the second speech enhancement model to be trained are jointly trained according to the difference between the predicted enhanced speech signal and the clean speech signal corresponding to the sample speech signal set and the difference between the predicted interfering speech signal and the interfering speech signal corresponding to the sample speech signal set, so as to obtain a trained first speech enhancement model and a trained second speech enhancement model.

Wherein, the sample voice signal set refers to a voice signal set participating in training. The sample reference signal refers to a reference signal corresponding to a set of speech signals involved in training.

The first voice enhancement model to be trained refers to a complex convolutional neural network to be trained; the second speech enhancement model to be trained refers to a noise reduction and echo cancellation fusion network to be trained.

Wherein the predicted speech masking information refers to a predicted speech mask; the predicted interference masking information refers to a predicted interference mask.

The interfering voice signal refers to a noise signal and an echo signal.

The terminal acquires a sample voice signal set and a sample reference signal corresponding to the sample voice signal set from a local database; and carrying out linear echo cancellation processing on each sample voice signal and each sample reference signal in the sample voice signal set to obtain a voice signal for canceling the linear echo, wherein the voice signal is used as a sample initial enhanced voice signal corresponding to the sample voice signal set. Then, the terminal performs STFT processing on each sample voice signal, each sample reference signal and each sample initial enhancement voice signal in the sample voice signal set to obtain spectrum information of each sample voice signal, spectrum information of each sample reference signal and spectrum information of each sample initial enhancement voice signal in the sample voice signal set, and inputs the spectrum information of each sample voice signal, the spectrum information of each sample reference signal and the spectrum information of each sample initial enhancement voice signal in the sample voice signal set into a first voice enhancement model to be trained to obtain sample target spectrum information. Then, the terminal extracts an original amplitude spectrum in the sample target spectrum information, performs logarithmic operation on the original amplitude spectrum to obtain a sample target amplitude spectrum, inputs the sample target amplitude spectrum into a second voice enhancement model to be trained, and obtains predicted voice masking information and predicted interference masking information. Then, the terminal multiplies the spectrum information of the sample target voice signal in the sample voice signal set and the predicted voice masking information to obtain a first multiplication result, and carries out ISTFT processing on the first multiplication result to obtain a predicted enhanced voice signal corresponding to the sample voice signal set; meanwhile, the terminal multiplies the spectrum information of the sample target voice signal and the predicted interference masking information to obtain a second multiplication result, and ISTFT processing is carried out on the second multiplication result to obtain a predicted interference voice signal corresponding to the sample voice signal set. Finally, the terminal obtains a first loss value according to the difference between the predicted enhanced voice signal and the clean voice signal corresponding to the sample voice signal set; obtaining a second loss value according to the difference between the predicted interference voice signal and the interference voice signal corresponding to the sample voice signal set; performing fusion processing (such as weighted summation) on the first loss value and the second loss value to obtain a target loss value; according to the target loss value, carrying out joint training on the first voice enhancement model to be trained and the second voice enhancement model to be trained until reaching the training ending condition, such as reaching the preset training times, the target loss value being smaller than a preset threshold value, and the like; and correspondingly taking the trained first voice enhancement model and the trained second voice enhancement model which reach the training ending condition as the trained first voice enhancement model and the trained second voice enhancement model.

Further, under the condition that the target loss value is greater than or equal to a preset threshold value, the terminal adjusts model parameters of a first voice enhancement model to be trained and a second voice enhancement model to be trained according to the target loss value, retrains the first voice enhancement model after the model parameter adjustment and the second voice enhancement model after the model parameter adjustment until the target loss value obtained according to the first voice enhancement model after training and the second voice enhancement model after training is smaller than the preset threshold value, and takes the first voice enhancement model after training as a first voice enhancement model after training and takes the second voice enhancement model after training as a second voice enhancement model after training.

According to the technical scheme provided by the embodiment of the disclosure, the first voice enhancement model to be trained and the second voice enhancement model to be trained are repeatedly trained by using the sample voice signal set, the sample reference signal corresponding to the sample voice signal set and the sample initial enhancement voice signal corresponding to the sample voice signal set, so that the accuracy of voice masking information output by the first voice enhancement model and the second voice enhancement model which are completed through training is improved, and the determination accuracy of the voice masking information is further improved; meanwhile, the voice quality of the target enhanced voice signal obtained based on the voice masking information is higher, and the voice enhancement effect is further improved.

Fig. 6 is a flowchart illustrating another voice signal enhancement method according to an exemplary embodiment, which is used in a terminal, as shown in fig. 6, and includes the steps of:

in step S610, a set of speech signals and reference signals corresponding to the set of speech signals are acquired.

In step S620, each of the speech signals and the reference signal in the speech signal set is input into the trained third speech enhancement model to obtain an initial enhanced speech signal corresponding to the speech signal set.

In step S630, the spectrum information of each speech signal in the speech signal set, the spectrum information of the reference signal and the spectrum information of the initial enhanced speech signal are input into the trained first speech enhancement model to obtain target spectrum information; the amount of target spectral information is smaller than the amount of spectral information input to the first speech enhancement model.

In step S640, an initial magnitude spectrum in the target spectrum information is extracted; and converting the initial amplitude spectrum to obtain a target amplitude spectrum.

In step S650, inputting the target amplitude spectrum into the trained second speech enhancement model for performing a first feature extraction process, so as to obtain an initial audio feature of the target amplitude spectrum; and carrying out second feature extraction processing on the initial audio features to obtain target audio features of a target amplitude spectrum.

In step S660, the target audio feature is classified to obtain speech masking information.

In step S670, the spectrum information of the target speech signal and the speech masking information in the speech signal set are fused, so as to obtain fused spectrum information of the target speech signal.

In step S680, the fused spectrum information is transformed to obtain a target enhanced speech signal corresponding to the speech signal set.

In the above voice signal enhancement method, when the voice signal enhancement is performed, the first voice enhancement model is utilized to output the target spectrum information with reduced number, and then the second voice enhancement model is utilized to process the target amplitude spectrum in the target spectrum information, so as to obtain the voice masking information, namely, the target spectrum information with reduced number is output firstly, and then the target amplitude spectrum in the target spectrum information is processed, without performing a series of processing on the spectrum information of each voice signal and outputting complex masking information of each voice signal, thereby simplifying the voice signal enhancement process and further reducing the computation complexity during voice signal enhancement. Meanwhile, when the target enhanced voice signal is obtained, only the output voice masking information and the spectrum information of the target voice signal in the voice signal set are needed to be utilized, and complex masking information and spectrum information of each voice signal are not needed to be considered, so that the computational complexity in the voice signal enhancement process is further reduced.

In order to more clearly clarify the speech signal enhancement method provided by the embodiments of the present disclosure, a specific embodiment of the speech signal enhancement method is described below. In an exemplary embodiment, as shown in fig. 3, the present disclosure further provides a multi-microphone noise reduction and echo cancellation fusion method with low computational complexity, where a complex convolution layer is used for front-end beam processing, and a real-number-based noise reduction and echo cancellation fusion network is connected in series, so as to finally obtain a low-complexity deep learning-based array algorithm; the method comprises the steps of directly inputting complex spectrums of multiple microphones at a network input layer by combining the minimum calculated amount with the multiple microphones and a deep learning algorithm, enabling the network to form wave beams at the front part, and finally directly outputting masking values of a central microphone by combining differences of noise, echo and voice signals, so that the effect of better eliminating the noise and the echo is achieved. The method specifically comprises the following steps:

referring to fig. 3, in a conference room scenario, a terminal acquires each microphone signal collected by a microphone array, and reference signals corresponding to the microphone signals; inputting each microphone signal and the reference signal into a linear AEC model for linear echo cancellation processing to obtain a microphone signal for canceling linear echo; STFT processing is carried out on each microphone signal, the reference signal and the microphone signal for eliminating the linear echo respectively to obtain the frequency spectrum information of each microphone signal, the frequency spectrum information of the reference signal and the frequency spectrum information of the microphone signal for eliminating the linear echo; the frequency spectrum information of each microphone signal, the frequency spectrum information of the reference signal and the frequency spectrum information of the microphone signal for eliminating the linear echo are input into the complex convolution neural network, the complex convolution neural network is used for carrying out beam forming operation, and the channel number of the frequency spectrum information, namely the number of beam directions, is output. The original amplitude spectrum in the target spectrum information output by the complex convolution neural network is subjected to logarithmic operation to obtain a logarithmic amplitude spectrum, and then the logarithmic amplitude spectrum is input into a noise reduction and echo cancellation fusion network with a CNN module and a GRU module which are provided with double branches. The left branch and the right branch of the CNN are respectively provided with 4 layers, the left branch and the right branch of the GRU are respectively provided with 3 layers, the left branch is defined as a voice branch, and the outputs of the three GRUs are combined and enter a first full-connection layer; the right is defined as the interfering branch (containing echo and noise) and the output of the last GRU goes into the second full connection layer. And finally, respectively obtaining a voice mask and an interference mask through an activation function layer.

According to the multi-microphone noise reduction and echo cancellation fusion method with low computational complexity, the computational complexity is not greatly improved while the multi-channel signals are utilized, and the computational complexity is reduced; in addition, in the conference scene, the elimination quality of echo and noise can be effectively improved, so that the conference experience is effectively improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

It should be understood that the same/similar parts of the embodiments of the method described above in this specification may be referred to each other, and each embodiment focuses on differences from other embodiments, and references to descriptions of other method embodiments are only needed.

Based on the same inventive concept, the embodiments of the present disclosure also provide a voice signal enhancement apparatus for implementing the above-mentioned related voice signal enhancement method.

Fig. 7 is a block diagram illustrating a speech signal enhancement apparatus according to an exemplary embodiment. Referring to fig. 7, the apparatus includes a signal acquisition unit 710, a first enhancement unit 720, a second enhancement unit 730, and a transform processing unit 740.

The signal acquisition unit 710 is configured to perform acquisition of a set of speech signals, a reference signal corresponding to the set of speech signals, and an initial enhanced speech signal corresponding to the set of speech signals.

A first enhancement unit 720 configured to perform inputting the spectral information of each speech signal in the speech signal set, the spectral information of the reference signal, and the spectral information of the initial enhanced speech signal into the trained first speech enhancement model to obtain target spectral information; the amount of target spectral information is smaller than the amount of spectral information input to the first speech enhancement model.

And a second enhancement unit 730 configured to perform inputting the target amplitude spectrum in the target spectrum information into the trained second speech enhancement model to obtain speech masking information.

The transformation processing unit 740 is configured to perform transformation processing on the spectrum information of the target voice signal in the voice signal set according to the voice masking information, so as to obtain a target enhanced voice signal corresponding to the voice signal set.

In an exemplary embodiment, the second enhancement unit 730 is further configured to perform the first feature extraction process of inputting the target amplitude spectrum into the trained second speech enhancement model, so as to obtain the initial audio feature of the target amplitude spectrum; performing second feature extraction processing on the initial audio features to obtain target audio features of a target amplitude spectrum; and classifying the target audio features to obtain voice masking information.

the second enhancing unit 730 is further configured to perform feature extraction processing on the target amplitude spectrum input into the voice branch network, so as to obtain a first audio feature of the target amplitude spectrum, and input the target amplitude spectrum into the interference branch network, so as to obtain a second audio feature of the target amplitude spectrum; carrying out fusion processing on the first audio feature and the second audio feature to obtain a first fusion audio feature; and inputting the first fusion audio features into a voice branch network for feature extraction processing to obtain initial audio features of the target amplitude spectrum.

In an exemplary embodiment, the second enhancing unit 730 is further configured to perform a feature extraction process for inputting the initial audio feature into the voice branch network to obtain a third audio feature of the target amplitude spectrum, and input the first fused audio feature into the interfering branch network to perform a feature extraction process to obtain a fourth audio feature of the target amplitude spectrum; fusing the third audio feature and the fourth audio feature to obtain a second fused audio feature; and inputting the second fusion audio features into a voice branch network for feature extraction processing to obtain target audio features of a target amplitude spectrum.

In an exemplary embodiment, the transformation processing unit 740 is further configured to perform fusion processing on the spectrum information of the target voice signal and the voice masking information in the voice signal set, so as to obtain fused spectrum information of the target voice signal; and carrying out transformation processing on the fused spectrum information to obtain a target enhanced voice signal corresponding to the voice signal set.

In an exemplary embodiment, the speech signal enhancement apparatus further includes an initial enhancement unit configured to perform inputting each speech signal in the speech signal set and the reference signal into a trained third speech enhancement model to obtain an initial enhanced speech signal;

The voice signal enhancement device further includes a conversion processing unit configured to perform extraction of an initial magnitude spectrum in the target spectrum information; and converting the initial amplitude spectrum to obtain a target amplitude spectrum.

In an exemplary embodiment, the speech signal enhancement apparatus further comprises a model training unit configured to perform obtaining a set of sample speech signals, a sample reference signal corresponding to the set of sample speech signals, and a sample initial enhanced speech signal corresponding to the set of sample speech signals; the spectrum information of each sample voice signal in the sample voice signal set, the spectrum information of the sample reference signal and the spectrum information of the sample initial enhancement voice signal are input into a first voice enhancement model to be trained, and sample target spectrum information is obtained; inputting a sample target amplitude spectrum in sample target frequency spectrum information into a second voice enhancement model to be trained to obtain predicted voice masking information and predicted interference masking information; according to the predicted voice masking information, carrying out conversion processing on the spectrum information of the sample target voice signal in the sample voice signal set to obtain a predicted enhanced voice signal corresponding to the sample voice signal set, and carrying out conversion processing on the spectrum information of the sample target voice signal according to the predicted interference masking information to obtain a predicted interference voice signal corresponding to the sample voice signal set; and carrying out joint training on the first voice enhancement model to be trained and the second voice enhancement model to be trained according to the difference between the predicted enhanced voice signal and the clean voice signal corresponding to the sample voice signal set and the difference between the predicted interference voice signal and the interference voice signal corresponding to the sample voice signal set to obtain a first voice enhancement model after training and a second voice enhancement model after training.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The various modules in the speech signal enhancement apparatus described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Fig. 8 is a block diagram illustrating an electronic device 800 for implementing a speech signal enhancement method according to an exemplary embodiment. For example, the electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 8, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, video, and so forth. The memory 804 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, optical disk, or graphene memory.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a computer-readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising instructions executable by the processor 820 of the electronic device 800 to perform the above-described method.

It should be noted that the descriptions of the foregoing apparatus, the electronic device, the computer readable storage medium, the computer program product, and the like according to the method embodiments may further include other implementations, and the specific implementation may refer to the descriptions of the related method embodiments and are not described herein in detail.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech signal enhancement, comprising:

2. The method according to claim 1, wherein said inputting the target magnitude spectrum in the target spectrum information into the trained second speech enhancement model to obtain the speech masking information comprises:

3. The method of claim 2, wherein the trained second speech enhancement model comprises a speech branching network and an interfering branching network;

4. A method according to claim 3, wherein said performing a second feature extraction process on said initial audio features to obtain target audio features of said target amplitude spectrum comprises:

5. The method of claim 1, wherein the transforming the spectrum information of the target speech signal in the speech signal set according to the speech masking information to obtain the target enhanced speech signal corresponding to the speech signal set includes:

6. The method of claim 1, wherein the initial enhanced speech signal is obtained by:

extracting an initial amplitude spectrum in the target spectrum information;

7. The method according to any of claims 1 to 6, wherein the trained first speech enhancement model and the trained second speech enhancement model are trained by:

8. A speech signal enhancement apparatus, comprising:

9. An electronic device, comprising:

A processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech signal enhancement method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech signal enhancement method according to any one of claims 1 to 7.