CN110931038B

CN110931038B - Voice enhancement method, device, equipment and storage medium

Info

Publication number: CN110931038B
Application number: CN201911166175.9A
Authority: CN
Inventors: 李明子; 马峰
Original assignee: Xi'an Xunfei Super Brain Information Technology Co ltd
Current assignee: Xi'an Xunfei Super Brain Information Technology Co ltd
Priority date: 2019-11-25
Filing date: 2019-11-25
Publication date: 2022-08-16
Anticipated expiration: 2039-11-25
Also published as: CN110931038A

Abstract

The application provides a voice enhancement method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining vocoder parameters of effective information in the target voice by using the frequency spectrum of the target voice, wherein the target voice at least comprises low signal-to-noise ratio voice information, and the low signal-to-noise ratio voice information is voice information of which the signal-to-noise ratio is smaller than a first preset value; synthesizing the frequency spectrum of the enhanced voice of the voice information with low signal-to-noise ratio by using the vocoder parameters and the vocoder; and generating the enhanced voice of the target voice according to the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information. The method for enhancing the voice of the target voice can synthesize the frequency spectrum of the enhanced voice of the voice information with the low signal-to-noise ratio by utilizing the vocoder parameters of the effective information in the target voice, and then the enhanced voice of the target voice can be generated according to the frequency spectrum of the enhanced voice of the voice information with the low signal-to-noise ratio.

Description

Voice enhancement method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech enhancement method, apparatus, device, and storage medium.

Background

Speech enhancement is a technique for extracting a useful speech signal from a noise background to suppress and reduce noise interference when the speech signal is interfered or even submerged by various noises. Current speech enhancement schemes mainly multiply the speech signal to be enhanced by a noise reduction gain to obtain an enhanced speech signal.

However, in some scenarios, the signal-to-noise ratio of at least part of the speech information in the speech signal is low, for example, in a vehicle-mounted application scenario, noise is mainly concentrated in a low frequency of the speech signal, that is, the signal-to-noise ratio of the speech information in the low frequency part is low, and for such a signal, if the speech enhancement is performed by using the existing speech enhancement scheme, speech distortion is very large, and hearing is seriously affected.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device and a storage medium for speech enhancement, so as to solve the problem that the speech distortion is very large and the hearing is seriously affected when the existing speech enhancement scheme is adopted to enhance the speech signal containing the speech information with low signal-to-noise ratio, and the technical scheme is as follows:

a method of speech enhancement comprising:

determining vocoder parameters of effective information in target voice by using a frequency spectrum of the target voice, wherein the target voice at least comprises low signal-to-noise ratio voice information, and the low signal-to-noise ratio voice information is voice information of which the signal-to-noise ratio is smaller than a first preset value;

synthesizing the spectrum of the enhanced speech of the low signal-to-noise ratio speech information by using the vocoder parameters and a vocoder;

and generating the enhanced voice of the target voice according to the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information.

Optionally, the determining vocoder parameters of valid information in the target speech by using the frequency spectrum of the target speech includes:

predicting vocoder parameters of effective information in the target voice by using the frequency spectrum of the target voice and a pre-established vocoder parameter model;

the vocoder parameter model is obtained by training the frequency spectrum of a training voice containing noise and the vocoder parameters of a noiseless voice corresponding to the training voice.

Optionally, the target voice includes low signal-to-noise ratio voice information and high signal-to-noise ratio voice information, where the high signal-to-noise ratio voice information is voice information whose signal-to-noise ratio is greater than a second preset value, and the first preset value is less than or equal to the second preset value;

said synthesizing the spectrum of enhanced speech of said low signal-to-noise ratio speech information using said vocoder parameters and vocoder comprises:

synthesizing the frequency spectrum of the effective information in the target voice by using the vocoder parameters and a vocoder;

and extracting the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information from the frequency spectrum of the effective information in the target voice.

Optionally, the generating the enhanced speech of the target speech according to the spectrum of the enhanced speech of the low snr speech information includes:

determining the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information;

and generating the enhanced voice of the target voice by utilizing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information and the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information.

Optionally, the determining the spectrum of the enhanced speech of the high snr speech information includes:

predicting the IRM by utilizing the frequency spectrum of the high signal-to-noise ratio voice information and a pre-established ideal proportional model IRM model, wherein the IRM model is obtained by training the frequency spectrum of the high signal-to-noise ratio voice information in the training voice containing noise and a real IRM corresponding to the training voice;

and determining the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information through the frequency spectrum of the high signal-to-noise ratio voice information and the predicted IRM.

Optionally, the generating the enhanced speech of the target speech by using the spectrum of the enhanced speech of the low snr speech information and the spectrum of the enhanced speech of the high snr speech information includes:

splicing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information with the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information to obtain a spliced frequency spectrum;

performing envelope compensation on the spliced frequency spectrum to obtain an envelope-compensated frequency spectrum;

and generating the enhanced voice of the target voice through the frequency spectrum after the envelope compensation.

Optionally, the performing envelope compensation on the spliced spectrum includes:

determining a compensation envelope of the spliced frequency spectrum by utilizing a pre-established envelope compensation model, wherein the envelope compensation model is obtained by adopting a frequency spectrum spliced by a frequency spectrum of enhanced voice of high signal-to-noise ratio voice information in training voice and a frequency spectrum of enhanced voice of low signal-to-noise ratio voice information in the training voice and a frequency spectrum of noiseless voice corresponding to the training voice;

and carrying out envelope compensation on the spliced frequency spectrum by using the compensation envelope.

Optionally, the low signal-to-noise ratio speech information in the target speech is speech information of which the frequency in the target speech is smaller than a preset frequency threshold, and the high signal-to-noise ratio speech information in the target speech is speech information of which the frequency in the target speech is greater than or equal to the preset frequency threshold.

A method of speech enhancement comprising:

acquiring a target voice;

determining the signal-to-noise ratio condition of the target voice;

if the target voice at least comprises low signal-to-noise ratio voice information, enhancing the target voice by adopting any one of the voice enhancement methods;

and if the target voice only contains high signal-to-noise ratio voice information, determining an ideal proportional model (IRM) by using the frequency spectrum of the target voice, and enhancing the target voice by using the IRM.

A speech enhancement device comprising: a vocoder parameter determining module, a low signal-to-noise ratio voice information enhancing module and an enhanced voice determining module;

the vocoder parameter determining module is used for determining vocoder parameters of effective information in target voice by using the frequency spectrum of the target voice, wherein the target voice at least comprises low signal-to-noise ratio voice information, and the low signal-to-noise ratio voice information is voice information of which the signal-to-noise ratio is smaller than a first preset value;

the low signal-to-noise ratio voice information enhancement module is used for synthesizing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information by utilizing the vocoder parameters and the vocoder;

and the enhanced voice determining module is used for generating the enhanced voice of the target voice according to the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information.

the enhanced speech determination module includes: the system comprises a high signal-to-noise ratio voice information enhancement module and an enhanced voice generation module;

the high signal-to-noise ratio voice information enhancement module is used for determining the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information;

and the enhanced voice generation module is used for generating the enhanced voice of the target voice by utilizing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information and the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information.

Optionally, the enhanced speech generating module includes: the system comprises a frequency spectrum splicing submodule, an envelope compensation submodule and an enhanced voice generating submodule;

the frequency spectrum splicing submodule is used for splicing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information with the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information to obtain a spliced frequency spectrum;

the envelope compensation submodule is used for carrying out envelope compensation on the spliced frequency spectrum to obtain an envelope compensated frequency spectrum;

and the enhanced voice generation sub-module is used for generating the enhanced voice of the target voice through the frequency spectrum after the envelope compensation.

A speech enhancement system comprising: a voice acquisition module, a signal-to-noise ratio condition determination module, a high signal-to-noise ratio voice enhancement module, and any one of the voice enhancement devices;

the voice acquisition module is used for acquiring target voice;

the signal-to-noise ratio condition determining module is used for determining the signal-to-noise ratio condition of the target voice;

the voice enhancement device is used for enhancing the target voice when the target voice at least comprises low signal-to-noise ratio voice information;

and the high signal-to-noise ratio voice enhancement module is used for determining an ideal proportional model (IRM) by using the frequency spectrum of the target voice when only the voice information with the high signal-to-noise ratio exists in the target voice, and enhancing the target voice by using the IRM.

A speech enhancement device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech enhancement method described in any one of the above.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech enhancement method of any of the preceding claims.

In view of the problem that, for target voice containing low snr voice information, the useful information in the low snr voice information is weak, and after multiplying the useful information by the noise reduction gain, the part of the information is almost lost, resulting in serious distortion of the enhanced voice. This application is owing to can confirm the vocoder parameter of effective information in the target voice, consequently can synthesize the frequency spectrum of the enhancement pronunciation of low SNR speech information in the target voice according to this vocoder parameter, synthesizes the enhancement pronunciation according to the frequency spectrum of the enhancement pronunciation of the low SNR speech information who utilizes the vocoder parameter synthesis, can reduce the distortion of low SNR speech information greatly to can promote the listening sense of enhancement pronunciation, user experience is better.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech enhancement method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a process for modeling vocoder parameters according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of establishing an IRM model according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of generating an enhanced speech of a target speech according to a spectrum of the enhanced speech of the low SNR speech information and a spectrum of the enhanced speech of the high SNR speech information according to an embodiment of the present application;

fig. 5 is a schematic flowchart of establishing an envelope compensation model according to an embodiment of the present application;

fig. 6 is a schematic flowchart of speech enhancement performed on target speech in a vehicle-mounted application scene according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech enhancement apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech enhancement system according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a speech enhancement device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For a voice signal containing low signal-to-noise ratio voice information, if the voice signal is enhanced by using the existing voice enhancement method, the low signal-to-noise ratio voice information is lost, and further, the voice distortion is serious, so that the hearing is influenced, and the user experience is poor.

In view of the foregoing problems, the present inventors have conducted intensive studies, and finally provide a speech enhancement method capable of enhancing a speech hearing sense, which is applicable to any application scenario, such as a vehicle-mounted application scenario, in which speech enhancement needs to be performed on speech at least including speech information with a low signal-to-noise ratio, and which is applicable to a terminal device with data processing capability, such as a smart phone, a vehicle-mounted navigation device, a smart speaker, and the like, wherein the terminal device is capable of enhancing speech including noise and outputting the enhanced speech through an output element, and the method is also applicable to a server, wherein the server is capable of receiving speech including noise provided by the terminal device through a network, enhancing the speech, transmitting the enhanced speech to the terminal device through the network, and the terminal device outputs the enhanced speech through the output element. The server may be a single server, a plurality of servers, or a server cluster, and the network may be, but is not limited to, a Local Area Network (LAN), a Wide Area Network (WAN), or the like.

The speech enhancement method provided by the present application is described next by the following embodiments.

Referring to fig. 1, a flow chart of a speech enhancement method provided in an embodiment of the present application is shown, where the method may include:

step S101: vocoder parameters for the information available in the target speech are determined using the spectrum of the target speech.

The target voice is a voice to be enhanced containing noise, and the vocoder parameter of the effective information in the target voice is a vocoder parameter of a noiseless voice corresponding to the target voice.

The target speech in this embodiment may be speech at least including low snr speech information in any application scenario, and the target speech at least includes low snr speech information under two conditions, one of which is that the target speech only includes low snr speech information, i.e., noise is distributed relatively uniformly and has a relatively low snr, and the other of which is that the target speech includes high snr speech information in addition to the low snr speech information, for example, the snr of a low frequency part in the speech signal in the vehicle-mounted application scenario is relatively low, and the snr of a high frequency part is relatively high, i.e., noise is distributed non-uniformly in the speech, and is mainly distributed in the low frequency part.

In addition, it should be noted that the low signal-to-noise ratio voice information in this embodiment refers to voice information whose signal-to-noise ratio is smaller than a first preset value, and the high signal-to-noise ratio voice information refers to voice information whose signal-to-noise ratio is greater than a second preset value, where the first preset value is smaller than or equal to the second preset value; the low frequency part in this embodiment refers to a part with a frequency less than a preset frequency threshold, the high frequency part refers to a part with a frequency greater than or equal to a preset frequency threshold, and the first preset value, the second preset value, and the preset frequency threshold may be set according to a specific application scenario.

The foregoing mentions that the vocoder parameters of the valid information in the target voice are determined by using the spectrum of the target voice, and in particular, the process of determining the vocoder parameters of the valid information in the target voice by using the spectrum of the target voice may include: and predicting vocoder parameters of effective information in the target voice by using the frequency spectrum of the target voice and a pre-established vocoder parameter model.

The vocoder parameter model is obtained by training the vocoder parameters of the noise-free voice corresponding to the training voice and the spectrum of the noise-containing training voice. It should be noted that the training speech including noise is obtained by superimposing noise on the noiseless speech, and the noiseless speech corresponding to the training speech is the noiseless speech without superimposed noise.

In this embodiment, the frequency spectrum of the target speech may be obtained by transforming the target speech to a frequency domain, and specifically, performing framing, windowing, and fourier transform on the target speech to obtain a frequency domain signal of the target speech.

Step S102: the spectrum of the enhanced speech of the low signal-to-noise ratio speech information is synthesized using the vocoder parameters and the vocoder.

The vocoder parameters are parameters used by the vocoder to synthesize a speech spectrum, and the vocoder parameters may include, but are not limited to, a spectrum envelope, a fundamental frequency, a voiced sound determination parameter, a non-periodic energy, and the like.

Because the vocoder parameter in step S102 is determined by using the vocoder parameter model, and the vocoder parameter model is obtained by using the spectrum of the training speech and the vocoder parameter of the noiseless speech corresponding to the training speech, the vocoder parameter in step S102 is the vocoder parameter of the effective information in the target speech, and the spectrum of the enhanced speech of the low snr speech information in the target speech can be determined according to the vocoder parameter of the effective information in the target speech.

Step S103: and generating the enhanced voice of the target voice according to the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information.

In view of the problem that, in the target voice containing low signal-to-noise ratio voice information, after the target voice containing low signal-to-noise ratio voice information is multiplied by noise reduction gain, part of information is almost lost, and the voice distortion after enhancement is serious, the method and the device for generating the enhanced voice of the target voice are provided. Because the embodiment of the application can determine the vocoder parameters of the effective information in the target voice, the spectrum of the enhanced voice of the voice information with low signal to noise ratio in the target voice can be synthesized according to the vocoder parameters, the enhanced voice is synthesized according to the spectrum of the enhanced voice of the voice information with low signal to noise ratio synthesized by using the vocoder parameters, the distortion of the voice information with low signal to noise ratio can be greatly reduced, the hearing sense of the enhanced voice can be improved, and the user experience is better.

The above-mentioned embodiments mention that the vocoder parameters of the valid information in the target voice are determined using the spectrum of the target voice and a pre-established vocoder parameter model, and the process of establishing the vocoder parameter model is described below.

Referring to fig. 2, a flow chart for modeling vocoder parameters is shown, which may include:

step S201: training speech is obtained from a training data set.

In this embodiment, each training speech in the training speech data set is obtained by superimposing at least one noise on the noise-free speech.

Taking a vehicle-mounted application scene as an example, wind dryness, air conditioner noise, tire noise, engine noise and the like can be recorded in a vehicle-mounted environment in advance, noiseless voices and at least one type of noise are superposed according to different signal-to-noise ratios, optionally, the noiseless voices and the same type of noise can be superposed according to different signal-to-noise ratios respectively to obtain a plurality of different training voices, and the noiseless voices can be superposed with different types of noise according to the same or different signal-to-noise ratios respectively to obtain a plurality of different training voices.

Step S202, determining the frequency spectrum of the training voice, and determining the vocoder parameters of the noiseless voice corresponding to the training voice.

In this embodiment, the training speech may be subjected to framing, windowing, and fourier transform to obtain a spectrum of the training speech; the vocoder parameters of the noiseless speech corresponding to the training speech may be determined using an existing parametric synthesis vocoder (e.g., STRAIGN, TANDEM-STRAIGHT, WORLD, etc.).

Step S202: and inputting the spectrum of the training voice into the vocoder parameter model to obtain the vocoder parameters predicted by the vocoder parameter model.

Step S204: and determining the prediction loss of the vocoder parameter model according to the vocoder parameters predicted by the vocoder parameter model and the vocoder parameters of the noiseless voice corresponding to the training voice.

Optionally, the MSE of the vocoder parameters predicted by the vocoder parameter model and the vocoder parameters of the noiseless speech corresponding to the training speech may be used as the prediction loss of the vocoder parameter model, specifically:

wherein eta (k, n) is a vocoder parameter of the noiseless speech corresponding to the training speech,

for the vocoder parameters predicted by the vocoder parameter model, Loss1 is the prediction Loss of the vocoder parameter model, k is the number of frequency points, and n is the number of frames.

Step S205: parameters of the vocoder parametric model are updated based on the prediction losses of the vocoder parametric model.

And (4) carrying out repeated iterative training according to the process until the prediction loss of the vocoder parameter model is 0, and indicating that the vocoder parameter model is well trained if the prediction loss of the vocoder parameter model is 0.

Optionally, the vocoder parameter model in this embodiment may be a deep learning network model, for example, a two-layer long-short-term memory network LSTM.

After the vocoder parameter model is established, the vocoder parameters of the effective information in the target voice can be predicted by using the spectrum of the target voice and the established vocoder parameter model, and after the vocoder parameters of the effective information in the target voice are obtained, the above step S102 is executed.

In another embodiment of the present application, for the "step S102: the spectrum of enhanced speech that synthesizes low signal-to-noise ratio speech information using vocoder parameters and vocoder "is introduced.

In the above embodiment, it is mentioned that the target speech in this embodiment may only include the low snr speech information, or may include both the low snr speech information and the high snr speech information, and the specific implementation process of step S102 is described in the following cases:

a. target speech only containing low signal-to-noise ratio speech information

The process of synthesizing the spectrum of enhanced speech for low signal-to-noise ratio speech information using vocoder parameters and vocoder may comprise: the spectrum of the useful information in the target speech is synthesized using the vocoder parameters and vocoder parameters as the spectrum of the enhanced speech for the low signal-to-noise ratio speech information.

Specifically, vocoder parameters are input into the vocoder, and the frequency spectrum of the effective information in the target voice output by the vocoder is obtained. It should be noted that the spectrum output by the vocoder is the spectrum of the entire effective information band in the target voice, and since the target voice only contains the low snr voice information, the spectrum output by the vocoder can be used as the spectrum of the enhanced voice of the low snr voice information.

b. The target voice contains both low signal-to-noise ratio voice information and high signal-to-noise ratio voice information

The process of synthesizing the spectrum of enhanced speech for low signal-to-noise ratio speech information using vocoder parameters and vocoder may comprise:

and b1, synthesizing the frequency spectrum of the effective information in the target voice by using the vocoder parameters and the vocoder.

Specifically, vocoder parameters are input into the vocoder to obtain the frequency spectrum of the effective information in the target voice output by the vocoder. It should be noted that the spectrum output by the vocoder is the spectrum of the entire frequency band of the effective information in the target voice.

Step b2, extracting the spectrum of the enhanced voice of the low signal-to-noise ratio voice information from the spectrum of the effective information in the target voice.

Since the spectrum output by the vocoder is the spectrum of the entire frequency band of the effective information in the target voice, and the target voice contains both the low snr voice information and the high snr voice information, this means that the spectrum output by the vocoder includes the spectrum of the enhanced voice of the low snr voice information and the spectrum of other voice information, and this step aims to obtain the spectrum of the enhanced voice of the low snr voice information, and therefore, the spectrum of the enhanced voice of the low snr voice information needs to be extracted from the spectrum output by the vocoder.

After obtaining the spectrum of the enhanced speech with low snr speech information, the above-mentioned "step S103: in another embodiment of the present application, a specific implementation process of the step S103 is described.

If the target voice only contains the low signal-to-noise ratio voice information, the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information is the frequency spectrum of the enhanced voice of the target voice, and at the moment, the enhanced voice of the target voice can be generated directly according to the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information.

If the target voice contains both the low signal-to-noise ratio voice information and the high signal-to-noise ratio voice information, the spectrum of the enhanced voice of the high signal-to-noise ratio voice information needs to be obtained in addition to the spectrum of the enhanced voice of the low signal-to-noise ratio voice information, and then the enhanced voice of the target voice is generated according to the spectrum of the enhanced voice of the low signal-to-noise ratio voice information and the spectrum of the enhanced voice of the high signal-to-noise ratio voice information.

Next, the spectrum of the enhanced speech that determines the high signal-to-noise ratio speech information will be described first.

In the above embodiment, it is mentioned that, by inputting vocoder parameters of valid information in target speech into a vocoder, a spectrum of the entire frequency band of the valid information in the target speech can be obtained, and in a possible implementation manner, a spectrum corresponding to high snr speech information in a spectrum output by the vocoder can be used as a spectrum of enhanced speech of the high snr speech information, but in consideration of a poor enhanced speech effect generated by using the spectrum determined by this manner, this embodiment provides another preferred implementation manner:

firstly, predicting the IRM by utilizing the frequency spectrum of high-signal-to-noise ratio voice information in target voice and a pre-established ideal proportional model IRM model; and then determining the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information through the frequency spectrum of the high signal-to-noise ratio voice information and the predicted IRM.

Optionally, a spectrum of the high signal-to-noise ratio speech information in the target speech in a log domain (i.e., log is calculated for the spectrum of the high signal-to-noise ratio speech information in the target speech) may be input into the pre-established IRM model to obtain the IRM predicted by the IRM model.

It should be noted that the purpose of obtaining the log of the spectrum of the high snr speech information in the target speech is to compress the dynamic range of the data and reduce the computation complexity, and of course, the spectrum of the high snr speech information in the target speech may also be directly input into a pre-established IRM model to obtain the IRM predicted by the IRM model.

The IRM model in this embodiment is obtained by using a spectrum of high signal-to-noise ratio speech information in a training speech containing noise and a real IRM training corresponding to the training speech.

The process of building the IRM model is described below.

Referring to fig. 3, a schematic flow chart of building an IRM model is shown, which may include:

step S301: training speech is obtained from a training data set.

Step S302, determining the frequency spectrum of high signal-to-noise ratio voice information in the training voice, and determining the real IRM corresponding to the training voice.

Specifically, the process of determining the frequency spectrum of high signal-to-noise ratio speech information in training speech includes: the frequency spectrum of the training speech is determined (framing, windowing and fourier transforming the training speech), and the frequency spectrum of the speech information with high signal-to-noise ratio is extracted from the frequency spectrum of the training speech.

The real IRM corresponding to the training speech can be calculated by the following formula:

in the above formula, s (k, n) is a noiseless voice corresponding to the training voice, n (k, n) is a noise in the training voice, i.e., a noise superimposed on the noiseless voice, IRM (k, n) is a real IRM corresponding to the training voice, k in the parentheses is the number of frequency points, and n is the number of frames.

Step S303: and predicting the IRM by utilizing the frequency spectrum of high signal-to-noise ratio voice information in the training voice and an IRM model.

Specifically, the log domain spectrum of high signal-to-noise ratio speech information in training speech can be input into the IRM model to obtain the IRM predicted by the IRM model.

Step S304: and determining the prediction loss of the IRM according to the IRM predicted by the IRM model and the real IRM corresponding to the training voice.

Optionally, the IRM predicted by the IRM model and the MSE of the real IRM corresponding to the training speech may be used as the prediction loss of the IRM model, specifically:

Loss2＝∑|IRM(k，n)-IRM′(k，n)| (3)

the IRM (k, n) is a real IRM corresponding to the training speech, the IRM' (k, n) is an IRM predicted by an IRM model, the Loss2 is a prediction Loss of the IRM model, k is the frequency point number, and n is the frame number.

Step S305: and updating parameters of the IRM model according to the predicted loss of the IRM model.

And (4) carrying out repeated iterative training according to the process until the prediction loss of the IRM model is 0, wherein the prediction loss of the IRM model is 0, which indicates that the IRM model is well trained.

Optionally, the IRM model in this embodiment may be a deep learning network model, for example, a two-layer long-term memory network LSTM.

After the IRM model is established, the frequency spectrum of the high signal-to-noise ratio voice information in the target voice and the established IRM model can be used for predicting the IRM, and then the frequency spectrum of the high signal-to-noise ratio voice information is multiplied by the predicted IRM, so that the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information in the target voice is obtained.

After obtaining the spectrum of the enhanced voice of the low signal-to-noise ratio voice information and the spectrum of the enhanced voice of the high signal-to-noise ratio voice information, the enhanced voice of the target voice can be generated according to the spectrum of the enhanced voice of the low signal-to-noise ratio voice information and the spectrum of the enhanced voice of the high signal-to-noise ratio voice information.

Referring to fig. 4, a schematic flow chart of generating an enhanced speech of a target speech according to a spectrum of the enhanced speech of the low snr speech information and a spectrum of the enhanced speech of the high snr speech information is shown, which may include:

step S401: and splicing the frequency spectrum of the enhanced voice of the voice information with low signal-to-noise ratio with the frequency spectrum of the enhanced voice of the voice information with high signal-to-noise ratio to obtain a spliced frequency spectrum.

Considering that a splicing part of a frequency spectrum of the enhanced speech of the low signal-to-noise ratio speech information and a frequency spectrum of the enhanced speech of the high signal-to-noise ratio speech information is abrupt and discontinuous, if the speech is directly generated according to the spliced frequency spectrum, the listening feeling is affected, and in order to solve the problem, the embodiment executes step S402.

Step S402: and carrying out envelope compensation on the spliced frequency spectrum to obtain an envelope compensated frequency spectrum.

Specifically, the process of performing envelope compensation on the spliced spectrum may include: and determining the compensation envelope of the spliced frequency spectrum by using a pre-established envelope compensation model, and carrying out envelope compensation on the spliced frequency spectrum by using the determined compensation envelope.

The envelope compensation model is obtained by adopting a spectrum formed by splicing a spectrum of enhanced voice of high signal-to-noise ratio voice information in training voice and a spectrum of enhanced voice of low signal-to-noise ratio voice information in the training voice and a spectrum training of noiseless voice corresponding to the training voice.

Referring to fig. 5, a schematic flow chart of building an envelope compensation model is shown, which may include:

step S501: and splicing the frequency spectrum of the enhanced voice of the voice information with the low signal-to-noise ratio in the training voice with the frequency spectrum of the enhanced voice of the voice information with the high signal-to-noise ratio in the training voice to obtain a spliced frequency spectrum.

Step S502: the spliced spectrum obtained in step S501 is input to an envelope compensation model, and a compensation envelope predicted by the envelope compensation model is obtained.

Step S503: and determining the prediction loss of the envelope compensation model according to the envelope of the spectrum of the noiseless voice corresponding to the training voice, the envelope of the spliced spectrum obtained in the step S501 and the predicted compensation envelope of the envelope compensation model.

Specifically, the predicted loss of the envelope compensation model can be calculated by the following formula:

wherein,

a compensation envelope predicted for the envelope compensation model, gamma (k) an envelope of a spectrum of the noiseless speech corresponding to the training speech,

the Loss _ bl is the compensation envelope predicted by the envelope compensation model for the envelope of the spliced spectrum obtained in step S501.

Step S504: and updating parameters of the envelope compensation model according to the prediction loss of the envelope compensation model.

And (4) carrying out repeated iterative training according to the process until the prediction loss of the envelope compensation model is 0, wherein the prediction loss of the envelope compensation model is 0, which indicates that the envelope compensation model is well trained.

Through the above process, the established envelope compensation model can be obtained, and then the compensation envelope of the spliced spectrum obtained in step S401 can be determined by using the envelope compensation model, and then the spliced spectrum is multiplied by the compensation envelope, thereby obtaining the envelope-compensated spectrum.

Step S403: and generating the enhanced voice of the target voice through the frequency spectrum after the envelope compensation.

On the basis of the above embodiments, the following describes a speech enhancement method provided by the present application by using a specific example:

assuming that the target speech is speech in the vehicle-mounted application scene, a signal-to-noise ratio of a low-frequency part (a part with a frequency less than a preset frequency threshold) in the speech signal is relatively low, and a signal-to-noise ratio of a high-frequency part (a part with a frequency greater than or equal to the preset frequency threshold) in the speech signal is relatively high, please refer to fig. 6, which shows a schematic flow chart of speech enhancement for the target speech in the vehicle-mounted application scene, and the schematic flow chart may include:

step S601: vocoder parameters for the information available in the target speech are determined using the spectrum of the target speech.

Specifically, the vocoder parameters of the effective information in the target speech can be predicted by using the frequency spectrum of the target speech and a pre-established vocoder parameter model.

Step S602: the spectrum of the enhanced speech of the high frequency part in the target speech is determined, and the spectrum of the enhanced speech of the low frequency part in the target speech is synthesized using the vocoder parameters and the vocoder.

The process of determining the spectrum of the enhanced speech of the high frequency part in the target speech may include: predicting the IRM by utilizing the frequency spectrum of the high-frequency part in the target voice and a pre-established IRM model; the spectrum of the enhanced speech in the high frequency portion of the target speech is then determined from the spectrum of the high frequency portion of the target speech and the predicted IRM.

Wherein, synthesizing the frequency spectrum of the low-frequency part in the target voice by using the vocoder parameters and the vocoder comprises: inputting the vocoder parameters into a vocoder to obtain the frequency spectrum of the effective information in the target voice output by the vocoder; and extracting the frequency spectrum of the low-frequency part of the enhanced voice in the target voice from the frequency spectrum of the effective information in the target voice.

Step S603: and generating the enhanced voice of the target voice according to the frequency spectrum of the enhanced voice of the high-frequency part in the target voice and the frequency spectrum of the enhanced voice of the low-frequency part in the target voice.

Specifically, the process of generating the enhanced speech of the target speech according to the spectrum of the enhanced speech of the high-frequency part in the target speech and the spectrum of the enhanced speech of the low-frequency part in the target speech may include:

step S6031, the frequency spectrum of the enhanced voice of the high-frequency part in the target voice is spliced with the frequency spectrum of the enhanced voice of the low-frequency part in the target voice to obtain a spliced frequency spectrum.

The spliced spectrum is as follows:

wherein f is _th For presetting a frequency threshold, synceech is the frequency spectrum of the enhanced voice of the low-frequency part in the target voice, X (k, n) × IRM (k is more than or equal to f) _th ) Splicing the synceech with X (k, n) × IRM for the frequency spectrum of the enhanced voice of the high-frequency part in the target voice to obtain the spliced frequency spectrum

Step S6032, the spliced frequency spectrum

And carrying out envelope compensation to obtain an envelope compensated spectrum X' (k, n).

Specifically, the spliced frequency spectrum is determined by utilizing a pre-established envelope compensation model

Of compensated envelope γ' _Δ (k) The envelope compensated spectrum X' (k, n) is determined using:

in step S6033, an enhanced speech of the target speech is generated from the envelope-compensated spectrum X' (k, n).

And performing inverse Fourier transform on the compensated spectrum X' (k, n) to obtain the enhanced voice of the target voice.

Compared with the voice enhancement method in the prior art, the voice enhancement method provided by the embodiment of the application can greatly reduce the distortion of the low signal-to-noise ratio voice information, thereby improving the hearing of the enhanced voice and having better user experience.

It should be noted that, the speech enhancement method provided by the foregoing embodiment is applied on the premise that the speech to be enhanced contains speech information with a low signal-to-noise ratio, however, at some times, the signal-to-noise ratio of the speech to be enhanced cannot be known, and based on this, this embodiment provides another speech enhancement method, which is suitable for enhancing speech with unknown signal-to-noise ratio, and the method may include:

acquiring a target voice; determining the signal-to-noise ratio condition of the target voice; if the target voice at least comprises the voice information with low signal-to-noise ratio, the voice enhancement method provided by the embodiment is adopted to enhance the target voice; and if the target voice only contains high signal-to-noise ratio voice information, determining the IRM by using the frequency spectrum of the target voice, and enhancing the target voice by using the IRM.

In the speech enhancement method provided in this embodiment, a vocoder parameter model and an IRM model are used, and when the vocoder parameter model and the IRM model are trained, in one possible implementation, the vocoder parameter model and the IRM model may be trained separately, and in another possible implementation, in order to improve a speech enhancement effect, the vocoder parameter model and the IRM model may be jointly trained, where a loss function of the joint training is:

Loss＝αLoss1+βLoss2 (7)

when a vocoder parameter model and an IRM model are jointly trained, the spectrum of training voice is input into the vocoder parameter model, the prediction Loss Loss1 of the vocoder parameter model is determined according to the vocoder parameters predicted by the vocoder parameter model and the vocoder parameters of the noiseless voice corresponding to the training voice, the log domain spectrum of high signal-to-noise ratio voice information in the training voice is input into the IRM model, the IRM is predicted according to the IRM model and the real IRM corresponding to the training voice, the prediction Loss Loss2 of the IRM model is determined, if the Loss1 is smaller than the Loss2, alpha is set to 1, beta is set to 0, namely the parameters of the vocoder parameter model are updated based on the Loss1, and if the Loss2 is smaller than the Loss1, alpha is set to 0, beta is set to 1, namely the parameters of the IRM model are updated according to the Loss 2.

The voice enhancement method provided by the embodiment of the application can enhance the voices with different signal-to-noise ratios in different voice enhancement modes, so that the enhanced voices with good audibility can be obtained, and the user experience is good.

The following describes a speech enhancement device provided in an embodiment of the present application, and the speech enhancement device described below and the speech enhancement method described above may be referred to correspondingly.

Referring to fig. 7, a schematic structural diagram of a speech enhancement device 70 according to an embodiment of the present application is shown, where the speech enhancement device 70 may include: a vocoder parameter determination module 701, a low signal-to-noise ratio speech information enhancement module 702, and an enhanced speech determination module 703.

A vocoder parameter determining module 701, configured to determine vocoder parameters of valid information in the target voice by using a spectrum of the target voice.

The target voice at least comprises low signal-to-noise ratio voice information, and the low signal-to-noise ratio voice information is voice information with a signal-to-noise ratio smaller than a first preset value.

A low snr speech information enhancement module 702 for synthesizing a spectrum of enhanced speech of the low snr speech information using the vocoder parameters and vocoder.

An enhanced speech determining module 703, configured to generate an enhanced speech of the target speech according to a spectrum of the enhanced speech of the low snr speech information.

The voice enhancement device provided by the embodiment of the application can determine the vocoder parameters of the effective information in the target voice, so that the spectrum of the enhanced voice of the voice information with the low signal to noise ratio in the target voice can be synthesized according to the vocoder parameters, the enhanced voice is synthesized according to the spectrum of the enhanced voice of the voice information with the low signal to noise ratio synthesized by the vocoder parameters, the distortion of the voice information with the low signal to noise ratio can be greatly reduced, the hearing sense of the enhanced voice can be improved, and the user experience is better.

In one possible implementation, the vocoder parameter determining module 701 in the speech enhancement apparatus 70 provided in the above embodiment is specifically configured to predict the vocoder parameters of the valid information in the target speech by using the spectrum of the target speech and a pre-established model of the vocoder parameters.

In a possible implementation manner, the target voice in the embodiment includes low signal-to-noise ratio voice information and high signal-to-noise ratio voice information, where the high signal-to-noise ratio voice information is voice information whose signal-to-noise ratio is greater than a second preset value, and the first preset value is less than or equal to the second preset value;

the low snr speech information enhancement module 702 can include: a spectrum synthesis sub-module and a spectrum extraction sub-module.

And the frequency spectrum synthesis sub-module is used for synthesizing the frequency spectrum of the effective information in the target voice by utilizing the vocoder parameters and the vocoder.

And the frequency spectrum extraction submodule is used for extracting the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information from the frequency spectrum of the effective information in the target voice.

In a possible implementation manner, the enhanced speech determination module 703 in the speech enhancement apparatus 70 provided in the above embodiment includes: the system comprises a high signal-to-noise ratio voice information enhancement module and an enhanced voice generation module.

And the high signal-to-noise ratio voice information enhancement module is used for determining the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information.

In a possible implementation manner, the speech information enhancement module with high snr includes: an IRM prediction sub-module and a high signal-to-noise ratio voice information enhancer module.

And the IRM prediction sub-module is used for predicting the IRM by utilizing the frequency spectrum of the high signal-to-noise ratio voice information and a pre-established ideal proportional model IRM model, wherein the IRM model is obtained by training the frequency spectrum of the high signal-to-noise ratio voice information in the training voice containing noise and the real IRM corresponding to the training voice.

And the high signal-to-noise ratio voice information enhancer module is used for determining the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information through the frequency spectrum of the high signal-to-noise ratio voice information and the predicted IRM.

In a possible implementation manner, the enhanced speech generation module includes: a spectrum splicing sub-module, an envelope compensation sub-module and an enhanced speech generation sub-module.

And the frequency spectrum splicing submodule is used for splicing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information with the frequency spectrum of the enhanced voice of the high signal-to-noise ratio voice information to obtain a spliced frequency spectrum.

And the envelope compensation submodule is used for carrying out envelope compensation on the spliced frequency spectrum to obtain an envelope compensated frequency spectrum.

In a possible implementation manner, the envelope compensation submodule is specifically configured to determine a compensation envelope of the spliced spectrum by using a pre-established envelope compensation model, and perform envelope compensation on the spliced spectrum by using the compensation envelope.

The envelope compensation model is obtained by adopting a spectrum formed by splicing a spectrum of enhanced voice of high signal-to-noise ratio voice information in training voice and a spectrum of enhanced voice of low signal-to-noise ratio voice information in the training voice, and a spectrum of noiseless voice corresponding to the training voice.

In a possible implementation manner, the low snr speech information in the target speech in the above embodiment is speech information of which the frequency in the target speech is smaller than a preset frequency threshold, and the high snr speech information in the target speech is speech information of which the frequency in the target speech is greater than or equal to the preset frequency threshold.

An embodiment of the present application further provides a speech enhancement system, please refer to fig. 8, which shows a schematic structural diagram of the speech enhancement system, and the speech enhancement system may include: a voice obtaining module 801, a signal-to-noise ratio condition determining module 802, a high signal-to-noise ratio voice enhancing module 803, and the voice enhancing apparatus 70 provided in the above embodiments.

A voice acquiring module 801, configured to acquire a target voice.

A signal-to-noise ratio condition determining module 802, configured to determine a signal-to-noise ratio condition of the target speech.

Speech enhancement means 70 for enhancing the target speech when at least low snr speech information is included in the target speech.

And a high signal-to-noise ratio speech enhancement module 803, configured to, when only speech information with a high signal-to-noise ratio exists in the target speech, determine an ideal ratio module IRM by using a frequency spectrum of the target speech, and enhance the target speech by using the IRM.

The voice enhancement system provided by the embodiment of the application can enhance the voice with different signal-to-noise ratios by adopting different voice enhancement modes, thereby obtaining the enhanced voice with better hearing and having better user experience.

An embodiment of the present application further provides a speech enhancement device, please refer to fig. 9, which shows a schematic structural diagram of the speech enhancement device, where the speech enhancement device may include: at least one processor 901, at least one communication interface 902, at least one memory 903 and at least one communication bus 904;

in the embodiment of the present application, the number of the processor 901, the communication interface 902, the memory 903, and the communication bus 904 is at least one, and the processor 901, the communication interface 902, and the memory 903 complete communication with each other through the communication bus 904;

processor 901 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, etc. configured to implement an embodiment of the invention;

the memory 903 may include a high-speed RAM memory, a non-volatile memory (non-volatile memory), and the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech enhancement, comprising:

when the target voice only contains low signal-to-noise ratio voice information, synthesizing the frequency spectrum of enhanced voice of the low signal-to-noise ratio voice information by utilizing the vocoder parameters and the vocoder;

when the target voice contains both low signal-to-noise ratio voice information and high signal-to-noise ratio voice information, synthesizing the frequency spectrum of effective information in the target voice by using the vocoder parameters and the vocoder, and extracting the frequency spectrum of enhanced voice of the low signal-to-noise ratio voice information from the frequency spectrum of the effective information in the target voice;

2. The method of claim 1, wherein the determining vocoder parameters of the valid information in the target speech using the spectrum of the target speech comprises:

3. The speech enhancement method according to claim 1, wherein the target speech comprises low signal-to-noise ratio speech information and high signal-to-noise ratio speech information, wherein the high signal-to-noise ratio speech information is speech information with a signal-to-noise ratio greater than a second preset value, and the first preset value is smaller than or equal to the second preset value;

4. The method of claim 3, wherein generating the enhanced speech of the target speech according to the spectrum of the enhanced speech of the low SNR speech information comprises:

5. The method of claim 4, wherein the generating the enhanced speech of the target speech using the spectrum of the enhanced speech of the low SNR speech information and the spectrum of the enhanced speech of the high SNR speech information comprises:

carrying out envelope compensation on the spliced frequency spectrum to obtain an envelope compensated frequency spectrum;

6. The speech enhancement method of claim 5, wherein the envelope compensating the spliced spectrum comprises:

7. A method of speech enhancement, comprising:

acquiring a target voice;

determining the signal-to-noise ratio condition of the target voice;

if the target voice at least comprises low signal-to-noise ratio voice information, enhancing the target voice by adopting the voice enhancement method according to any one of claims 1-6;

8. A speech enhancement device, comprising: a vocoder parameter determining module, a low signal-to-noise ratio voice information enhancing module and an enhanced voice determining module;

the low signal-to-noise ratio voice information enhancement module is used for synthesizing the frequency spectrum of the enhanced voice of the low signal-to-noise ratio voice information by utilizing the vocoder parameters and the vocoder when the target voice only contains the low signal-to-noise ratio voice information; when the target voice contains both low signal-to-noise ratio voice information and high signal-to-noise ratio voice information, synthesizing the frequency spectrum of effective information in the target voice by using the vocoder parameters and the vocoder, and extracting the frequency spectrum of enhanced voice of the low signal-to-noise ratio voice information from the frequency spectrum of the effective information in the target voice;

9. A speech enhancement device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program and realizing the steps of the voice enhancement method according to any one of claims 1-6.

10. A readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of a speech enhancement method according to any one of claims 1 to 6.