CN113409802B

CN113409802B - Method, device, equipment and storage medium for enhancing voice signal

Info

Publication number: CN113409802B
Application number: CN202011180004.4A
Authority: CN
Inventors: 鲍枫; 李岳鹏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2023-09-15
Anticipated expiration: 2040-10-29
Also published as: CN113409802A

Abstract

The application discloses a method, a device, equipment and a storage medium for enhancing and processing a voice signal, and belongs to the technical field of audio and video. The method comprises the following steps: acquiring a target voice signal; performing enhancement processing on the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhancement signal; determining a target voice enhancement mode according to the reference enhancement signal; and carrying out enhancement processing on the target voice signal by adopting a target voice enhancement mode. Compared with the prior art that the fixed voice enhancement mode cannot be used for carrying out distinguishing processing on different situations of the voice signals, the embodiment of the application fully considers the signal characteristics of the voice signals in the voice signal enhancement process, is beneficial to accurately and effectively enhancing the voice signals and improves the enhancement effect of the voice signals.

Description

Method, device, equipment and storage medium for enhancing voice signal

Technical Field

The embodiment of the application relates to the technical field of audio and video, in particular to a method, a device, equipment and a storage medium for enhancing a voice signal.

Background

In the situations of work, life, entertainment and the like, people often acquire a large number of voice signals. For example, voice signals are involved in teleconferencing, video telephony, live singing, and the like.

Although the voice signal shows explosive growth and becomes an indispensible component in the work and life of people, the quality of the voice signals with different sources and various types is uneven, and most of the voice signals contain noise. The speech signal may be enhanced in order to suppress noise in the speech signal, enhance useful signals in the speech signal. In the related art, the enhancement modes for the voice signal include a wideband enhancement mode and an ultra wideband enhancement mode, wherein the wideband enhancement mode can better enhance the low-frequency signal, and is generally used for enhancing the voice signal with the bandwidth of 0 to 8KHz (kilohertz); the ultra-wideband enhancement mode can better enhance high-frequency signals, and is generally used for enhancing voice signals with the bandwidth of 8-16 KHz. Therefore, in the case where a certain speech signal includes both a low-frequency signal portion and a high-frequency signal portion, that is, the bandwidth of the speech signal is 0 to 16KHz, the enhancement method adopted by the related art for the speech signal is: the low-frequency signal part with the bandwidth of 0 to 8KHz adopts a broadband enhancement mode, and the high-frequency signal part with the bandwidth of 8 to 16KHz adopts an ultra-wideband enhancement mode.

However, it is assumed that the useful signal in the low-frequency signal portion of a certain speech signal is almost buried in noise, and it will be difficult to accurately and efficiently identify the useful signal and noise at this time. If the voice enhancement mode in the related art is adopted, that is, the wideband enhancement mode is directly adopted for the low-frequency signal part, noise is likely to be mistakenly used as a useful signal to be enhanced, and therefore the purposes of suppressing the noise and enhancing the useful signal are overcome, and the voice signal is not beneficial to being accurately and effectively enhanced.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for enhancing a voice signal, which can accurately and effectively enhance the voice signal and improve the enhancement effect of the voice signal. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for enhancing a speech signal, where the method includes:

acquiring a target voice signal;

performing enhancement processing on the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhancement signal;

determining a target voice enhancement mode according to the reference enhancement signal;

and carrying out enhancement processing on the target voice signal by adopting the target voice enhancement mode.

In another aspect, an embodiment of the present application provides an apparatus for enhancing a speech signal, where the apparatus includes:

the voice signal acquisition module is used for acquiring a target voice signal;

the reference signal determining module is used for enhancing the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal;

the enhancement mode determining module is used for determining a target voice enhancement mode according to the reference enhancement signal;

and the voice signal enhancement module is used for enhancing the target voice signal by adopting the target voice enhancement mode.

In yet another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where at least one instruction, at least one section of program, a code set, or an instruction set is stored in the memory, where the at least one instruction, the at least one section of program, the code set, or the instruction set is loaded and executed by the processor to implement the method for enhancing a speech signal as described above.

In yet another aspect, an embodiment of the present application provides a computer readable storage medium, where at least one instruction, at least one program, a code set, or an instruction set is stored, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by a processor to implement the method for enhancing a speech signal.

In yet another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the above-described enhancement processing method of the voice signal.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

after the voice signal is obtained, the voice signal is enhanced by adopting a reference voice enhancement mode to obtain a reference enhancement signal, then the voice enhancement mode adopted in the actual enhancement processing is further determined based on the reference enhancement signal, and then the determined actual adopted voice enhancement mode is adopted to enhance the voice signal. The reference enhancement signal can reflect the signal characteristics of the initially acquired voice signal, such as a signal reflecting whether the voice signal is obvious with most noise, so that the actually adopted voice enhancement mode can be determined in a targeted manner according to the signal characteristics of the reference enhancement signal. Compared with the prior art that the fixed voice enhancement mode cannot be used for carrying out distinguishing processing on different situations of the voice signal, the embodiment of the application fully considers the signal characteristics of the voice signal in the voice signal enhancement process, is beneficial to accurately and effectively enhancing the voice signal and improves the enhancement effect of the voice signal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method for enhancing processing of a speech signal according to one embodiment of the present application;

FIG. 3 is a schematic diagram of a method for enhancing processing of a speech signal according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech enhancement effect provided by one embodiment of the present application;

FIG. 5 is a block diagram of an enhancement processing device for speech signals according to one embodiment of the present application;

FIG. 6 is a block diagram of an enhancement processing apparatus for speech signals according to another embodiment of the present application;

fig. 7 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The technical scheme provided by the embodiment of the application is suitable for any service scene with the requirement of enhancing the voice signal, such as a voice conference, a video conference, a voice recording, a video recording and the like.

Referring to fig. 1, a schematic diagram of an application scenario provided by an embodiment of the present application is shown. The application scene can be realized into a cloud video conference system, and the cloud video conference system is a video conference platform based on cloud technology.

Cloud Technology (Cloud Technology) refers to a hosting Technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology refers to the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, and can form a resource pool, and the cloud computing business model application system is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

As shown in fig. 1, the cloud video conference system may include: a terminal 10 and a server 20.

The number of terminals 10 may be one or more. The terminal 10 may be, but is not limited to, a smart phone, tablet, notebook, desktop computer, smart box, smart watch, etc.

The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

In one example, a client running a target application, such as an application providing video conferencing functionality, is installed in the terminal 10. The server 20 may be a background server of the target application for providing background services to clients of the target application.

In the method for enhancing a voice signal according to the embodiment of the present application, the execution subject of each step may be the terminal 10, for example, a client terminal of the terminal 10 for installing and running a target application program, or may be the server 20, or the terminal 10 and the server 20 may be interactively cooperated to execute, i.e. a part of the steps of the method are executed by the terminal 10 and another part of the steps are executed by the server 20.

For convenience of description, in the embodiments of the method for enhancing a voice signal described below, only the execution subject of each step is described as a computer device, where the computer device refers to an electronic device having data computing, processing and storage capabilities, such as the terminal 10 or the server 20, which is not limited in this embodiment of the present application.

Referring to fig. 2, a flowchart of a method for enhancing a speech signal according to an embodiment of the application is shown. The method can be applied to computer equipment such as the terminal 10 or the server 20. The method may include the following steps (steps 210-240):

step 210, a target speech signal is acquired.

The target voice signal refers to a voice signal which needs noise suppression or useful signal enhancement, and can be a voice signal collected by an audio collection device (such as a microphone) in a real environment. In general, the target voice signal contains noise, and optionally, the noise may be noise such as environmental noise, howling, and the like. For example, in a cloud video conference scenario, a microphone collects a speech signal generated by a participant when speaking, and at the same time, the microphone may also collect a noise signal due to environmental, equipment, etc., in which case the noise signal collected by the microphone and the speech signal generated by the participant when speaking together constitute a target speech signal.

In one example, the target speech signal comprises an ultra wideband speech signal. For example, in a cloud video conference scenario, a microphone collects a target voice signal at a sampling frequency of 32KHz, where the bandwidth of the target voice signal is 0 to 16KHz. It should be noted that, as technology evolves, the ultra wideband voice signal may have a larger bandwidth, or the names corresponding to the voice signal with a larger bandwidth may vary, and it should be understood that these shall fall within the protection scope of the present application.

And 220, performing enhancement processing on the target voice signal by adopting a reference voice enhancement mode to obtain a reference enhanced signal.

As can be seen from the description of the background art, for a certain voice signal, the related art uses a fixed voice enhancement method to enhance the voice signal, for example, in the case that the voice signal is an ultra wideband voice signal, the enhancement method of the related art includes: a wideband enhancement mode is employed for the low frequency signal portion of the speech signal and an ultra wideband enhancement mode is employed for the high frequency signal portion of the speech signal. However, this fixed speech enhancement approach does not distinguish between different situations of the speech signal, and is disadvantageous for accurately and efficiently enhancing the speech signal.

Based on this, in the embodiment of the present application, after the target speech signal is obtained, the computer device does not directly enhance the target speech signal and output the enhanced speech signal, but first enhances the target speech signal by using a reference speech enhancement mode to obtain a reference enhanced signal, then further determines a speech enhancement mode adopted in actual enhancement processing, that is, a target speech enhancement mode, based on the reference enhanced signal, and then enhances the target speech signal by using the target speech enhancement mode. The reference enhancement signal can reflect the signal characteristics of the target voice signal, so that the target voice enhancement mode can be determined in a targeted manner according to the reference enhancement signal, the signal characteristics of the target voice signal are fully considered in the target voice signal enhancement process, the target voice signal can be accurately and effectively enhanced, and the enhancement effect of the target voice signal is improved.

The embodiment of the application does not limit the types of the reference voice enhancement modes, and optionally, the reference voice mode comprises a wideband enhancement mode, such as a voice enhancement mode with a sampling rate of 16 KHz; alternatively, the reference speech enhancement mode includes an ultra wideband enhancement mode, such as a speech enhancement mode with a sampling rate of 32 KHz; alternatively, the reference speech enhancement modes include a wideband enhancement mode and an ultra wideband enhancement mode. The embodiment of the application does not limit the specific content of the reference voice enhancement mode, and optionally, the reference voice enhancement mode comprises at least one of the following: LSTM (Long Short-Term Memory network) with a sampling rate of 16KHz, LSTM with a sampling rate of 32KHz, GRU (Gated Recurrent Unit, gate cycle unit) with a sampling rate of 16KHz, and GRU with a sampling rate of 32 KHz. For other description of the reference enhancement mode, the target speech signal is enhanced by the reference enhancement mode to obtain the reference enhancement signal, etc., please refer to the following method embodiments, which are not repeated here.

In step 230, a target speech enhancement mode is determined based on the reference enhancement signal.

The computer device may further determine signal characteristics of the target speech signal according to the obtained reference enhancement signal, and determine a target speech enhancement mode according to the signal characteristics of the target speech signal. For example, in the case where the reference enhancement mode includes the wideband enhancement mode, the computer device performs enhancement processing on the target voice signal by the wideband enhancement mode to obtain the reference enhancement signal, and then determines that the pitch of the reference enhancement signal is smaller by further processing the reference enhancement signal, and determines that the useful signal in the low-frequency portion of the target voice signal is submerged in noise, the computer device should avoid enhancement processing on the low-frequency portion of the target voice signal by the wideband enhancement mode to avoid the purpose of enhancing noise in the low-frequency portion against noise suppression, and at this time, the computer device may enhance the low-frequency portion by the ultra wideband enhancement mode. For further description of how the computer device determines the target speech enhancement mode according to the reference enhancement signal, please refer to the following method embodiments, which are not repeated here.

And step 240, performing enhancement processing on the target voice signal by adopting a target voice enhancement mode.

The target voice enhancement mode is an enhancement mode actually adopted when enhancement processing is carried out on the target voice signal. Alternatively, the target voice enhancement mode may be a single enhancement mode, such as an ultra wideband enhancement mode; the method can also be fusion of multiple enhancement modes, such as a wideband enhancement mode for a low-frequency signal part in the target voice signal and an ultra wideband enhancement mode for a high-frequency signal part in the target voice signal. After the computer equipment determines the target voice enhancement mode, enhancement processing is carried out on the target voice signal according to the target voice enhancement mode, and the voice signal after the enhancement processing can be obtained. Optionally, the computer device may send the enhanced speech signal to an audio output device (such as a speaker) so that the audio output device outputs the enhanced speech signal, thereby improving the signal quality of the speech signal.

In summary, according to the technical scheme provided by the embodiment of the application, after a voice signal is obtained, a reference enhanced signal is obtained by enhancing the voice signal in a reference voice enhancement mode, then the voice enhancement mode adopted in the actual enhancement processing is further determined based on the reference enhanced signal, and then the voice signal is enhanced in the determined actual voice enhancement mode. The reference enhancement signal can reflect the signal characteristics of the initially acquired voice signal, such as a signal reflecting whether the voice signal is obvious with most noise, so that the actually adopted voice enhancement mode can be determined in a targeted manner according to the signal characteristics of the reference enhancement signal. Compared with the prior art that the fixed voice enhancement mode cannot be used for carrying out distinguishing processing on different situations of the voice signal, the embodiment of the application fully considers the signal characteristics of the voice signal in the voice signal enhancement process, is beneficial to accurately and effectively enhancing the voice signal and improves the enhancement effect of the voice signal.

In one example, the reference speech enhancement mode includes a first speech enhancement mode and a second speech enhancement mode.

Because the types of the voice enhancement modes are various, the computer equipment can respectively process the target voice signal based on the various voice enhancement modes to obtain the reference enhancement signal, and then process the reference enhancement signal to determine the actually adopted voice enhancement mode, namely the target voice enhancement mode. In order to enable effective comparison between reference enhancement signals, in the embodiment of the present application, the computer device may perform enhancement processing on the target speech signal based on speech enhancement modes with different sampling rates, so as to obtain the reference enhancement signal. Based on this, the reference speech enhancement mode may include a first speech enhancement mode and a second speech enhancement mode, wherein the sampling rate of the second speech enhancement mode is less than the sampling rate of the first speech enhancement mode. Optionally, the sampling rate of the second speech enhancement mode is one half of the sampling rate of the first speech enhancement signal. Illustratively, the second speech enhancement mode is a wideband speech enhancement mode with a sampling rate of 16KHz, e.g., the second speech enhancement mode includes speech enhancement based on LSTM with a sampling rate of 16 KHz; the first speech enhancement mode is an ultra wideband speech enhancement mode with a sampling rate of 32KHz, e.g., the first speech enhancement mode includes speech enhancement based on a GRU with a sampling rate of 32 KHz.

Based on this, the above step 220 includes the following steps:

step 221, performing enhancement processing on the target voice signal by adopting a first voice enhancement mode to obtain a first enhancement signal.

Optionally, in order to facilitate rapid determination of the reference speech enhancement mode, the sampling rate of the first speech enhancement mode is the same as the sampling rate of the target speech signal. Therefore, the computer equipment can directly perform enhancement processing on the target voice signal in the first voice enhancement mode so as to obtain a first enhancement signal.

Step 223, performing a downsampling process on the target voice signal to obtain a downsampled voice signal.

It is because the sampling rate of the first speech enhancement mode is the same as the sampling rate of the target speech signal, and the sampling rate of the first speech enhancement mode is greater than the sampling rate of the second speech enhancement mode, so the sampling rate of the second speech enhancement mode is less than the sampling rate of the target speech signal. Therefore, before the target voice signal is enhanced by the second voice enhancement mode, the computer equipment needs to perform downsampling processing on the target voice signal to reduce the sampling rate of the target voice signal, so that the sampling rate of the downsampled voice signal is the same as the sampling rate of the second voice enhancement mode.

And step 225, performing enhancement processing on the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhancement signal.

After the downsampled speech signal is obtained, the computer device may perform enhancement processing on the downsampled speech signal in a second speech enhancement mode to obtain a second enhanced signal. Thus, the reference enhancement signal comprises the first enhancement signal and the second enhancement signal described above.

For example, the target speech signal comprises a speech signal having a bandwidth of 0 to 16KHz and a sampling rate of 32KHz, the first speech enhancement mode comprises speech enhancement based on a GRU of 32KHz, and the second speech enhancement mode comprises speech enhancement based on an LSTM of 16 KHz. The computer equipment directly carries out enhancement processing on the target voice signal in a first voice enhancement mode to obtain a first enhancement signal; and the computer equipment performs down-sampling processing on the target voice signal, reduces the sampling rate of the target voice signal to 16KHz to obtain a down-sampled voice signal, and then performs enhancement processing on the down-sampled voice signal in a second enhancement mode to obtain a second enhancement signal.

It should be noted that, in the embodiment of the present application, the execution sequence between the step 221 and the steps 223 and 225 is not limited, and optionally, the step 221 is executed before the steps 223 and 225; alternatively, step 221 is performed concurrently with step 223 and step 225; alternatively, step 221 is performed after step 223 and step 225. It should be understood that these are all intended to be within the scope of the present application.

Based on the steps 221 to 225, in one example, the step 230 includes the following steps:

step 232, extracting a third enhancement signal from the first enhancement signal according to the frequency range of the second enhancement signal, wherein the frequency range of the third enhancement signal is the same as the frequency range of the second enhancement signal.

Since the sampling rate of the second speech enhancement mode is smaller than the sampling rate of the first speech enhancement mode, the frequency range of the second enhancement signal obtained by the second speech enhancement mode will also be smaller than the frequency range of the first enhancement signal obtained by the first speech enhancement mode. If the processing such as comparison and calculation is performed based on the first enhancement signal and the second enhancement signal, the enhancement signal portions having the same frequency range need to be compared to improve accuracy of the processing result and the like.

Therefore, the computer device needs to extract the enhancement signal portion corresponding to the same frequency range as the frequency range of the second enhancement signal, that is, the third enhancement signal, from the first enhancement signal in accordance with the frequency range of the second enhancement signal. For example, the first enhancement signal has a frequency range of 0 to 16KHz and the second enhancement signal has a frequency range of 0 to 8KHz, and then the computer apparatus needs to extract the enhancement signal portion having a frequency range of 0 to 8KHz from the first enhancement signal as the third enhancement signal.

Step 234, determining the target speech enhancement mode according to the third enhancement signal and the second enhancement signal.

In the embodiment of the application, the computer equipment determines the actually adopted voice enhancement mode based on two enhancement signals with the same frequency range, namely a third enhancement signal and a second enhancement signal.

Optionally, the target voice signal includes a first signal portion and a second signal portion, a frequency range of the second signal portion is the same as a frequency range of the second enhancement signal, and the frequency range of the first signal portion is a frequency range other than the frequency range of the second signal portion in the frequency range of the target voice signal; the step 234 includes the following steps:

(1) And calculating the correlation coefficient of the third enhancement signal and the second enhancement signal.

The degree of correlation between the two signals can be reflected by the correlation coefficient of the two signals. In the embodiment of the application, the computer equipment can calculate the correlation coefficient of the third enhancement signal and the second enhancement signal to determine the degree of correlation between the third enhancement signal and the second enhancement signal. Illustratively, assume that the third enhancement signal corresponds to a gain g ₁ The second enhancement signal corresponds to a gain g ₂ The correlation coefficient corr of the third enhancement signal and the second enhancement signal is calculated as follows:

(2) In the case where the correlation coefficient is greater than the first threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode for the first signal portion and employing a second speech enhancement mode for the second signal portion.

In general, the greater the correlation coefficient, the higher the degree of correlation between the two signals. In the embodiment of the application, a first threshold value is set, and under the condition that the correlation coefficient is larger than the first threshold value, the third enhancement signal and the second enhancement signal are determined to have stronger correlation; and determining that the correlation between the third enhancement signal and the second enhancement signal is weak in the case that the correlation coefficient is smaller than the first threshold. Optionally, the first threshold is 0.05, or 0.06, or 0.04, where in the application process, the value of the first threshold may be actually determined according to a requirement of calculation accuracy, etc., and the value of the first threshold is not limited in the embodiment of the present application.

Under the condition that the correlation coefficient is larger than the first threshold value, the computer equipment determines that the third enhancement signal and the second enhancement signal have stronger correlation, so that enhancement processing can be carried out on the target voice signal in a mode of combining two voice enhancement modes. In the embodiment of the present application, when the correlation coefficient is greater than the first threshold, the computer device performs enhancement processing in a first speech enhancement mode on a high-frequency signal portion (i.e., a first signal portion) in the target speech signal, and performs enhancement processing in a second speech enhancement mode on a low-frequency signal portion (i.e., a second signal portion) in the target speech signal.

(3) In the case that the correlation coefficient is less than the first threshold, determining the target speech enhancement mode includes employing the first speech enhancement mode on the target speech signal.

In the event that the correlation coefficient is less than the first threshold, the computer device determines that the correlation between the third enhancement signal and the second enhancement signal is weaker, possibly due to the presence of more noise in the low frequency signal portion of the target speech signal. In order to avoid the problem that noise is enhanced and noise suppression is violated, in the embodiment of the present application, the computer device may perform enhancement processing on the target speech signal in a first speech enhancement manner if the correlation coefficient is smaller than the first threshold.

It should be noted that, in the case where the correlation coefficient is equal to the first threshold, the computer device may execute a processing manner such as in the case where the correlation coefficient is smaller than the first threshold, that is, determining the target speech enhancement manner includes adopting the first speech enhancement manner for the target speech signal; processing means such as in the case where the correlation coefficient is greater than the first threshold may also be performed, i.e. determining the target speech enhancement means comprises applying a first speech enhancement means to the first signal portion and a second speech enhancement means to the second signal portion. It should be understood that both of these ways are within the scope of the present application.

Based on the steps 221 to 225, in another example, the target voice signal includes a first signal portion and a second signal portion, the frequency range of the second signal portion is the same as the frequency range of the second enhancement signal, and the frequency range of the first signal portion is a frequency range other than the frequency range of the second signal portion in the frequency range of the target voice signal; the step 230 includes the following steps:

in step 231, a target frequency range is acquired, the target frequency range comprising at least one frequency.

The larger the gain of the speech signal after the enhancement processing is, the worse the noise suppression effect of the enhancement processing is. Thus, in the embodiment of the present application, the computer device may compare the gain of the first enhancement signal with the gain of the second enhancement signal to determine the speech enhancement mode actually used. In the embodiment of the present application, the computer device compares the gains of the first enhancement signal and the second enhancement signal corresponding to at least one frequency in a certain frequency range, and determines the target speech enhancement mode according to the final gain count.

Thus, the computer device first needs to determine the frequency range of the gain comparison. As can be seen from the above description, since the sampling rate of the second speech enhancement mode is smaller than that of the first speech enhancement mode, the frequency range of the second enhancement signal will also be smaller than that of the first enhancement signal. In the gain comparison process, in order to improve accuracy, gain comparison needs to be performed based on the same frequency range, so in the embodiment of the present application, the target frequency range is determined based on the frequency range of the second enhancement signal.

Optionally, the computer device may take the frequency range of the second enhancement signal directly as the target frequency range; or, the computer device intercepts part of the frequency range from the frequency range of the second enhancement signal as the target frequency range, and the size of the target frequency range is not limited in the embodiment of the application, and in the application process, the size of the target frequency range can be actually determined by combining the factors such as calculation accuracy, processing overhead of the computer device and the like. For example, the second enhancement signal may have a frequency in the range of 0 to 8KHz, and the target frequency may be in the range of 0 to 8KHz or 0.6 to 1.5KHz.

Step 233, for a first frequency of the at least one frequency, determining a gain of the first enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency.

The target frequency range includes at least one frequency. The embodiment of the application does not limit the division manner of at least one frequency in the target frequency range, and optionally, the at least one frequency is associated with the sampling points, that is, one sampling point corresponds to one frequency in the target frequency range; alternatively, at least one frequency is randomly selected within the target frequency range.

The computer device will compare the gains of the first enhancement signal and the second enhancement signal on each of the at least one frequency. Thus, the computer device needs to first determine the gains of the first enhancement signal and the second enhancement signal on each of the at least one frequency. Taking a first frequency of the at least one frequency as an example, the computer device needs to determine the gain of the first enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency, respectively.

Step 235, adjusting the gain count according to the magnitude relation between the gain of the first enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency.

At each of the at least one frequency, the computer device compares the gain of the first enhancement signal with the gain of the second enhancement signal and adjusts the value of the gain count based on the result of the comparison. In the embodiment of the present application, the gain count adjustment method includes adding a process and subtracting a process, optionally, adding a process to the gain count when the gain of the first enhancement signal is greater than the gain of the second enhancement signal; alternatively, in the case where the gain of the first enhanced signal is smaller than the gain of the second enhanced signal, the gain count is decremented by one. Taking a first frequency of the at least one frequency as an example, if the gain of the first enhancement signal at the first frequency is greater than the gain of the second enhancement signal at the first frequency, adding one to the gain count; if the gain of the first enhancement signal at the first frequency is smaller than the gain of the second enhancement signal at the first frequency, the gain count is decremented by one.

It should be noted that, the following steps 237 and 239 are described by taking the process of adding one to the gain count when the gain of the first enhancement signal is greater than the gain of the second enhancement signal as an example. In another aspect, in the embodiment of the present application, when the gain of the first enhancement signal is equal to the gain of the second enhancement signal, the gain count may be increased, the gain count may be decreased, or the gain count may not be adjusted. It should be understood that these are all intended to be within the scope of the present application.

For example, assume that the target frequency range is 0.6 to 1.5KHz, i represents the frequency within the target frequency range, and i is the frequency corresponding to the sampling point in 0.6 to 1.5KHz, the gain of the first enhancement signal is g ₁ The gain of the second enhanced signal is g ₂ The adjustment process of the gain count is as follows.

count＝0

if g ₁ [i]>g ₂ [i]，0.6KHz≤i≤1.5KHz

count+1

else

count-1

In step 237, determining the target speech enhancement mode includes applying a first speech enhancement mode to the first signal portion and applying a second speech enhancement mode to the second signal portion if the gain count value for completing the adjustment process is greater than zero.

After the gain count adjustment process is completed, the computer equipment determines a target voice enhancement mode according to the gain count value of the gain count after the gain count adjustment process is completed. As is clear from the above description, the larger the signal gain after the enhancement processing is, the poorer the noise suppression effect is. In the embodiment of the present application, the gain count is added when the gain of the first enhancement signal is greater than the gain of the second enhancement signal, so if the gain count after the adjustment process is greater than zero, it means that the gain of the first enhancement signal is greater than the gain of the second enhancement signal, so that it is clear that the noise suppression effect of the first speech enhancement mode is worse than the noise suppression effect of the second speech enhancement mode in the frequency range corresponding to the sampling rate of the second speech enhancement mode. Therefore, when the gain count after the adjustment process is completed is greater than zero, the computer device performs enhancement processing in a first voice enhancement mode on a high-frequency signal portion (i.e., a first signal portion) in the target voice signal, and performs enhancement processing in a second voice enhancement mode on a low-frequency signal portion (i.e., a second signal portion) in the target voice signal.

In step 239, determining the target speech enhancement mode includes applying a first speech enhancement mode to the target speech signal if the gain count that completes the adjustment process is less than zero.

If the gain count after the adjustment process is smaller than zero, the gain of the first enhancement signal is smaller than the gain of the second enhancement signal, so that it is clear that the noise suppression effect of the first voice enhancement mode is better than the noise suppression effect of the second voice enhancement mode in the frequency range corresponding to the sampling rate of the second voice enhancement mode. Therefore, under the condition that the gain count value after the adjustment process is smaller than zero, the computer equipment adopts a first voice enhancement mode to carry out enhancement processing on the target voice signal.

It should be noted that, in the case where the value of the gain count after completing the adjustment process is equal to zero, the computer device may execute a processing manner in the case where the value of the gain count after completing the adjustment process is less than zero, that is, determining the target speech enhancement manner includes adopting the first speech enhancement manner for the target speech signal; processing means such as in the case where the gain count that completes the adjustment process has a value greater than zero, i.e., determining the target speech enhancement means includes employing a first speech enhancement means for the first signal portion and a second speech enhancement means for the second signal portion, may also be performed. It should be understood that both of these ways are within the scope of the present application.

In summary, according to the technical scheme provided by the embodiment of the application, after the reference enhancement signal is obtained, the correlation coefficient of the reference enhancement signal is further determined, and then the voice enhancement mode adopted in the actual enhancement processing is determined according to the difference of the correlation coefficient of the reference enhancement signal. Because the correlation coefficient of the reference enhanced signals can reflect the correlation degree between the reference enhanced signals, the signal characteristics of the voice signals, such as whether the low-frequency signal part of the voice signals is excessively noisy, can be further clarified through the correlation degree between the reference enhanced signals. According to the embodiment of the application, the actually adopted voice enhancement mode is determined according to the correlation coefficient of the reference enhancement signal, the signal characteristics of the voice signal are fully considered, and the enhancement effect of the voice signal is improved.

In addition, the technical scheme provided by the embodiment of the application compares the gain of the reference enhancement signal on at least one frequency in a specific frequency range after the reference enhancement signal is obtained, adjusts the value of the gain count according to the magnitude relation between the gains, and further determines the voice enhancement mode adopted in the actual enhancement processing according to the value of the gain count. The larger the signal gain after the enhancement processing is, the worse the noise suppression effect is, and the noise suppression effect of each reference enhancement mode can be clarified by comparing the gain of the reference enhancement signal, so that references are provided for the computer equipment to determine the actually adopted voice enhancement mode, and the computer equipment is facilitated to select the voice enhancement mode with better noise suppression effect.

In another example, the step 220 includes the following steps:

in step 22A, the target speech signal is subjected to downsampling, so as to obtain a downsampled speech signal.

In general, a wideband enhancement mode is adopted for a low-frequency signal to achieve a better speech enhancement effect, so that the wideband enhancement mode can be adopted to enhance the low-frequency signal to obtain an enhanced speech signal, and then the enhanced speech signal is analyzed to determine whether the low-frequency signal is excessively noisy, whether the wideband enhancement mode is used to significantly enhance noise, and the like.

In the embodiment of the application, in order to realize that the target voice signal is enhanced by adopting the reference voice enhancement mode with lower sampling rate, the computer equipment needs to perform downsampling processing on the target voice signal before enhancing the target voice signal by adopting the reference voice enhancement mode so as to reduce the sampling rate of the target voice signal, and the sampling rate of the voice signal after downsampling is the same as that of the reference voice enhancement mode. For example, the sampling rate of the target speech signal is 32KHz, and the sampling rate of the reference speech enhancement mode is 16KHz, the sampling rate of the target speech signal needs to be reduced to 16KHz.

In step 22B, the downsampled speech signal is enhanced by using the reference speech enhancement method, so as to obtain a reference enhanced signal.

After the downsampled speech signal is obtained, the computer device may perform enhancement processing on the downsampled speech signal using a reference speech enhancement mode to obtain a reference enhanced signal.

Based on the steps 22A and 22B, the step 230 includes the following steps:

step 23A, pitch period estimation is performed on the reference enhanced signal to obtain the pitch period of the reference enhanced signal.

From the pitch period of the signal, it can be determined whether the signal carries excessive noise. Thus, the computer device may first perform a pitch period estimation on the reference enhanced signal to obtain a pitch period of the reference enhanced signal. The mode of estimating the pitch period according to the embodiments of the present application is not limited, and optionally, the pitch period estimation includes any one of the following modes: time-domain autocorrelation method and frequency-domain transformation method.

Step 23B, determining the target speech enhancement mode according to the pitch period of the reference enhancement signal.

After obtaining the pitch period of the reference enhanced signal, the computer device may determine the target speech enhancement mode directly from the pitch period of the reference enhanced signal, e.g., comparing the pitch period of the reference enhanced signal to a period threshold, and determining the target speech enhancement mode from the comparison; the target speech enhancement mode may also be determined based on the processing result by further processing the pitch period of the reference enhancement signal, for example, further deriving the pitch of the reference enhancement signal or the pitch frequency of the reference enhancement signal based on the pitch period of the reference enhancement signal, and then further determining the target speech enhancement mode based on the pitch of the reference enhancement signal or the pitch frequency of the reference enhancement signal.

Next, the pitch period of the reference enhanced signal is further processed by the computer device, and then the target speech enhancement mode is determined according to the processing result for description.

Optionally, the target speech signal comprises a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the reference enhancement signal, the frequency range of the first signal portion being a frequency range of the target speech signal other than the frequency range of the reference signal portion. Step 23B includes: determining a pitch of the reference enhancement signal based on the pitch period of the reference enhancement signal; determining the target speech enhancement mode includes employing a first speech enhancement mode for the first signal portion and a second speech enhancement mode for the second signal portion if the pitch of the reference enhancement signal is greater than a second threshold; in the case that the pitch of the reference enhancement signal is less than the second threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode on the target speech signal.

By referring to the pitch period of the enhancement signal, the computer device may further determine the pitch of the reference enhancement signal. In general, the higher the pitch of a signal, the greater the useful signal component in the signal; the lower the pitch of a signal, the greater the noise component in that signal. Therefore, in the embodiment of the application, the second threshold is set, and if the pitch of the reference enhanced signal is greater than the second threshold, the larger the useful signal component of the reference enhanced signal is; if the pitch of the reference enhanced signal is less than the second threshold, the larger the noise component of the reference enhanced signal is explained. The specific value of the second threshold is not limited in the embodiment of the present application, alternatively, the value of the second threshold is 50, or 60, or 80, and in the application process, the value of the second threshold may be actually determined by combining with factors such as calculation accuracy.

Because the sampling rate of the reference voice enhancement mode is smaller than that of the target voice signal, if the pitch of the reference voice enhancement signal obtained by the reference voice enhancement mode is higher, the reference voice enhancement mode achieves a better voice enhancement effect on the target voice signal, and therefore the reference voice enhancement mode can be adopted to enhance the low-frequency signal part of the target voice signal. Based on this, in the embodiment of the present application, when the pitch of the reference enhanced signal is greater than the second threshold, the computer device determines that the target speech enhancement mode includes adopting a first speech enhancement mode for the first signal portion and adopting a second speech enhancement mode for the second signal portion, where the second speech enhancement mode is the reference speech enhancement mode, and the sampling rate of the second speech enhancement mode is smaller than the sampling rate of the first speech enhancement mode; in the event that the pitch of the reference enhancement signal is less than the second threshold, the computer device determines the target speech enhancement mode includes employing a first speech enhancement mode on the target speech signal.

It should be noted that, in the case where the pitch of the reference enhanced signal is equal to the second threshold, the computer device may execute a processing manner as in the case where the pitch of the reference enhanced signal is smaller than the second threshold, that is, determining the target speech enhancement manner includes adopting the first speech enhancement manner for the target speech signal; the processing means may also be performed as in the case where the pitch of the reference enhancement signal is greater than the second threshold, i.e. the determining the target speech enhancement means comprises applying a first speech enhancement means to the first signal portion and a second speech enhancement means to the second signal portion. It should be understood that both of these ways are within the scope of the present application.

In summary, according to the technical scheme provided by the embodiment of the application, after the reference enhancement signal is obtained, the pitch period of the reference enhancement signal is estimated, and then the voice enhancement mode adopted in the actual enhancement processing is determined according to the estimated pitch period. The signal characteristics of the reference enhancement signal, such as the magnitude relation between the noise component and the useful signal component of the reference enhancement signal, can be clarified through the pitch period, so that the computer equipment can determine the noise suppression effect of the reference enhancement mode, thereby providing reference for determining the actually adopted voice enhancement mode and being beneficial to effectively and accurately selecting the voice enhancement mode by the computer equipment.

The above embodiments specifically describe three schemes for determining the target speech enhancement mode according to the reference enhancement signal, and it should be understood that, in practical application, the target speech enhancement mode may be determined by combining the above three schemes. The embodiment of the application does not limit the combination modes and the combination sequence of the three schemes, and the following describes one possible combination mode and the combination sequence.

In one example, the reference speech enhancement mode includes a first speech enhancement mode and a second speech enhancement mode, the second speech enhancement mode having a sampling rate that is less than the sampling rate of the first speech enhancement mode.

Based on this, the step 220 includes: performing enhancement processing on the target voice signal by adopting a first voice enhancement mode to obtain a first enhancement signal; performing downsampling processing on the target voice signal to obtain a downsampled voice signal; performing enhancement processing on the voice signal subjected to the downsampling in a second voice enhancement mode to obtain a second enhancement signal; wherein the reference enhancement signal comprises a first enhancement signal and a second enhancement signal.

Based on this, in one example, the target voice signal includes a first signal portion and a second signal portion, the frequency range of the second signal portion being the same as the frequency range of the second enhancement signal, the frequency range of the first signal portion being a frequency range other than the frequency range of the second signal portion in the frequency range of the target voice signal; the step 230 includes:

(1) Extracting a third enhancement signal from the first enhancement signal according to the frequency range of the second enhancement signal, the frequency range of the third enhancement signal being the same as the frequency range of the second enhancement signal; calculating a correlation coefficient of the third enhancement signal and the second enhancement signal; in the case where the correlation coefficient is greater than the first threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode for the first signal portion and employing a second speech enhancement mode for the second signal portion.

(2) Under the condition that the correlation coefficient is smaller than a first threshold value, estimating the pitch period of the second enhanced signal to obtain the pitch period of the second enhanced signal; determining a pitch of the second enhancement signal based on the pitch period of the second enhancement signal; in the case that the pitch of the second enhancement signal is greater than the second threshold, determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion.

(3) Acquiring a target frequency range, which includes at least one frequency, in case the pitch of the second enhancement signal is smaller than a second threshold; determining, for a first frequency of the at least one frequency, a gain of the third enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency; adjusting the value of the gain count according to the magnitude relation between the gain of the third enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency; if the gain of the third enhancement signal at the first frequency is greater than the gain of the second enhancement signal at the first frequency, adding one to the gain count; if the gain of the third enhancement signal at the first frequency is smaller than the gain of the second enhancement signal at the first frequency, subtracting one from the gain count; under the condition that the gain count value after finishing the adjustment process is greater than zero, determining the target voice enhancement mode comprises adopting a first voice enhancement mode for the first signal part and adopting a second voice enhancement mode for the second signal part; under the condition that the value of the gain count after finishing the adjustment process is smaller than zero, determining the target voice enhancement mode comprises adopting a first voice enhancement mode for the target voice signal.

The steps and nouns not described in this example may be referred to the description of the above embodiments, which are not repeated here.

In summary, according to the technical scheme provided by the embodiment of the application, after the voice signal is obtained, the voice signal is enhanced by adopting the reference voice enhancement mode to obtain the reference enhanced signal, and then the voice enhancement mode adopted in the actual enhancement processing is further determined by combining a plurality of modes based on the reference enhanced signal, so that the voice enhancement mode adopted in the actual enhancement processing is determined from a plurality of dimensions, and the accuracy of the determination of the voice enhancement mode adopted in the actual is further improved.

In the following, the technical solution of the present application will be described by taking a reference enhancement mode including a GRU with a sampling rate of 32KHz (hereinafter abbreviated as "GRU with 32 KHz") and an LSTM with a sampling rate of 16KHz (hereinafter abbreviated as "LSTM with 16 KHz"), and a target voice signal as a voice signal with a sampling rate of 32KHz (hereinafter abbreviated as "voice signal with 32 KHz"). Referring to fig. 3, a schematic diagram of a method for enhancing processing of a speech signal according to an embodiment of the application is shown, and the method includes the following steps:

after the computer equipment obtains the voice signal of 32KHz, on one hand, the GRU of 32KHz is adopted to carry out enhancement processing on the voice signal of 32KHz, so as to obtain an enhancement signal of 32 KHz; on the other hand, the voice signal of 32KHz is processed by downsampling to obtain a downsampled voice signal, and then the downsampled voice signal is enhanced by adopting the LSTM of 16KHz to obtain an enhanced signal of 16 KHz.

Wherein, the bandwidth of the 32KHz enhancement signal is 0 to 16KHz, and the bandwidth of the 16KHz enhancement signal is 0 to 8KHz. The computer device calculates the correlation coefficient corr of the signal portion with the bandwidth of 0 to 8KHz in the 32KHz enhanced signal and the 16KHz enhanced signal, that is, performs the signal cross-correlation calculation.

As shown in fig. 3, in the case where the calculated correlation coefficient corr is greater than or equal to 0.05, the computer device determines that the speech enhancement mode actually adopted for the speech signal of 32KHz includes: a16 KHz LSTM is used for the signal portion with the bandwidth of 0 to 8KHz in the 32KHz voice signal, and a 32KHz GRU is used for the signal portion with the bandwidth of 8 to 16KHz in the 32KHz voice signal.

In case the calculated correlation coefficient corr is smaller than 0.05, the computer device further performs pitch period estimation on the enhancement signal of 16KHz, and further processes the estimated genetic period to obtain the pitch of the enhancement signal of 16 KHz. As shown in fig. 3, in the case that the pitch is greater than 50, the computer device determines that the voice enhancement mode actually adopted for the voice signal of 32KHz includes: a16 KHz LSTM is used for the signal portion with the bandwidth of 0 to 8KHz in the 32KHz voice signal, and a 32KHz GRU is used for the signal portion with the bandwidth of 8 to 16KHz in the 32KHz voice signal.

As shown in fig. 3, in the case where the pitch is less than or equal to 50, the computer device further compares the gain of the enhancement signal of 32KHz with the gain of the enhancement signal of 16KHz, and adjusts the value of the gain technique count, that is, the computer device performs the gain comparison count process. For the specific process of the gain comparison counting process, please refer to the description of the above embodiment, and the description is omitted here. As shown in fig. 3, in the case where the gain count after completing the adjustment process has a value greater than 0, the computer device determines that the speech enhancement mode actually adopted for the speech signal of 32KHz includes: a16 KHz LSTM is used for the signal portion with the bandwidth of 0 to 8KHz in the 32KHz voice signal, and a 32KHz GRU is used for the signal portion with the bandwidth of 8 to 16KHz in the 32KHz voice signal. As shown in fig. 3, in the case where the gain count after completing the adjustment process has a value less than or equal to 0, the computer device determines that the speech enhancement mode actually adopted for the speech signal of 32KHz includes: a 32KHz GRU is used for a 32KHz speech signal.

Referring to fig. 4, a schematic diagram of a voice enhancement effect provided by an embodiment of the present application is shown. In fig. 4 (a), the enhancement signal obtained by enhancing the voice signal with LSTM of 16KHz, and in fig. 4 (b), the enhancement signal obtained by enhancing the voice signal with GRU of 32KHz, it can be seen from fig. 4 (a) and fig. 4 (b), the high-frequency signal portion of the voice signal cannot be effectively enhanced with LSTM of 16KHz, and the low-frequency signal portion of the voice signal cannot be accurately and effectively enhanced with GRU of 32 KHz. Fig. 4 (c) is an enhancement signal obtained by enhancing a voice signal by adopting the technical scheme provided by the embodiment of the present application, and comparing fig. 4 (c) with fig. 4 (a) and fig. 4 (b), it can be obtained that the voice signal can be accurately and effectively enhanced by adopting the technical scheme provided by the embodiment of the present application.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.

Referring to fig. 5, a block diagram of an apparatus for enhancing a speech signal according to an embodiment of the application is shown. The device has the function of realizing the example of the enhancement processing method of the voice signal, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The apparatus may be the computer device described above or may be provided in a computer device. The apparatus 500 may include: a speech signal acquisition module 510, a reference signal determination module 520, an enhancement mode determination module 530, and a speech signal enhancement module 540.

The voice signal acquisition module 510 is configured to acquire a target voice signal.

The reference signal determining module 520 is configured to perform enhancement processing on the target speech signal by using a reference speech enhancement mode, so as to obtain a reference enhanced signal.

An enhancement mode determining module 530, configured to determine a target speech enhancement mode according to the reference enhancement signal.

The voice signal enhancement module 540 is configured to perform enhancement processing on the target voice signal in the target voice enhancement mode.

In one example, the reference speech enhancement mode includes a first speech enhancement mode and a second speech enhancement mode, the second speech enhancement mode having a sampling rate that is less than the sampling rate of the first speech enhancement mode; the reference signal determining module 520 is configured to: performing enhancement processing on the target voice signal by adopting the first voice enhancement mode to obtain a first enhancement signal; performing downsampling processing on the target voice signal to obtain a downsampled voice signal; performing enhancement processing on the downsampled voice signal by adopting the second voice enhancement mode to obtain a second enhancement signal; wherein the reference enhancement signal comprises the first enhancement signal and the second enhancement signal.

In one example, as shown in fig. 6, the enhancement mode determining module 530 includes: a reference signal extraction unit 532, configured to extract a third enhancement signal from the first enhancement signal according to a frequency range of the second enhancement signal, where the frequency range of the third enhancement signal is the same as the frequency range of the second enhancement signal; and an enhancement mode determining unit 534, configured to determine the target speech enhancement mode according to the third enhancement signal and the second enhancement signal.

In one example, the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as the frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal; as shown in fig. 6, the enhancement mode determining unit 534 is configured to: calculating a correlation coefficient of the third enhancement signal and the second enhancement signal; determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and the second speech enhancement mode for the second signal portion if the correlation coefficient is greater than a first threshold; determining the target speech enhancement mode includes employing the first speech enhancement mode on the target speech signal if the correlation coefficient is less than a first threshold.

In one example, the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as the frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal; as shown in fig. 6, the enhancement mode determining module 530 includes: a frequency range module unit 531 configured to acquire a target frequency range, where the target frequency range includes at least one frequency; a signal gain determining unit 533 for determining, for a first frequency of the at least one frequency, a gain of the first enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency; a gain count adjustment unit 535 for adjusting the value of the gain count according to the magnitude relation between the gain of the first enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency; if the gain of the first enhancement signal at the first frequency is greater than the gain of the second enhancement signal at the first frequency, adding one to the gain count; if the gain of the first enhancement signal at the first frequency is smaller than the gain of the second enhancement signal at the first frequency, subtracting one from the gain count; an enhancement mode determining unit 537, configured to determine, when the value of the gain count after completing the adjustment process is greater than zero, that the target speech enhancement mode includes adopting the first speech enhancement mode for the first signal portion and adopting the second speech enhancement mode for the second signal portion; the enhancement mode determining unit 537 is further configured to determine, when the value of the gain count after completing the adjustment process is less than zero, that the target speech enhancement mode includes adopting the first speech enhancement mode for the target speech signal.

In one example, the second speech enhancement mode has a sampling rate that is one-half the sampling rate of the first speech enhancement signal.

In one example, the first speech enhancement mode includes speech enhancement based on a GRU; the second speech enhancement mode includes LSTM-based speech enhancement.

In one example, the reference signal determination module 520 is configured to: performing downsampling processing on the target voice signal to obtain a downsampled voice signal; and enhancing the voice signal after the downsampling by adopting the reference voice enhancement mode to obtain the reference enhanced signal.

In one example, as shown in fig. 6, the enhancement mode determining module 530 includes: a pitch period determining unit 53A, configured to perform pitch period estimation on the reference enhanced signal, so as to obtain a pitch period of the reference enhanced signal; an enhancement mode determining unit 53B, configured to determine the target speech enhancement mode according to the pitch period of the reference enhancement signal.

In one example, the target speech signal includes a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the reference enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the reference signal portion in the frequency range of the target speech signal; as shown in fig. 6, the enhancement mode determining unit 53B is configured to: determining a pitch of the reference enhancement signal according to a pitch period of the reference enhancement signal; determining the target speech enhancement mode includes employing a first speech enhancement mode for the first signal portion and a second speech enhancement mode for the second signal portion if the pitch of the reference enhancement signal is greater than a second threshold; the sampling rate of the second voice enhancement mode is smaller than that of the first voice enhancement mode; in the case that the pitch of the reference enhancement signal is less than a second threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode on the target speech signal.

In one example, the reference speech enhancement mode includes a first speech enhancement mode and a second speech enhancement mode, the second speech enhancement mode having a sampling rate that is less than the sampling rate of the first speech enhancement mode; the reference signal determining module 520 is configured to: performing enhancement processing on the target voice signal by adopting a first voice enhancement mode to obtain a first enhancement signal; performing downsampling processing on the target voice signal to obtain a downsampled voice signal; performing enhancement processing on the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhancement signal; wherein the reference enhancement signal comprises the first enhancement signal and the second enhancement signal. The target voice signal comprises a first signal part and a second signal part, the frequency range of the second signal part is the same as the frequency range of the second enhancement signal, and the frequency range of the first signal part is a frequency range except for the frequency range of the second signal part in the frequency range of the target voice signal; the enhancement mode determining module 530 is configured to: extracting a third enhancement signal from the first enhancement signal according to the frequency range of the second enhancement signal, wherein the frequency range of the third enhancement signal is the same as the frequency range of the second enhancement signal; calculating a correlation coefficient of the third enhancement signal and the second enhancement signal; determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and the second speech enhancement mode for the second signal portion if the correlation coefficient is greater than a first threshold; under the condition that the correlation coefficient is smaller than a first threshold value, estimating the pitch period of the second enhanced signal to obtain the pitch period of the second enhanced signal; determining a pitch of the second enhancement signal based on a pitch period of the second enhancement signal; determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion if the pitch of the second enhancement signal is greater than a second threshold; acquiring a target frequency range, which includes at least one frequency, in case the pitch of the second enhancement signal is smaller than a second threshold; determining, for a first frequency of the at least one frequency, a gain of the third enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency; adjusting the value of gain counting according to the magnitude relation between the gain of the third enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency; if the gain of the third enhancement signal at the first frequency is greater than the gain of the second enhancement signal at the first frequency, adding one to the gain count; if the gain of the third enhancement signal at the first frequency is smaller than the gain of the second enhancement signal at the first frequency, subtracting one from the gain count; determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion if the gain count value for completing the adjustment process is greater than zero; and under the condition that the value of the gain count after finishing the adjustment process is smaller than zero, determining the target voice enhancement mode comprises adopting the first voice enhancement mode for the target voice signal.

In one example, the target speech signal comprises an ultra wideband speech signal.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to fig. 7, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a terminal or a server. Specifically, the present application relates to a method for manufacturing a semiconductor device.

The computer device 700 includes a central processing unit (Central Processing Unit, CPU) 701, a system Memory 704 including a random access Memory (Random Access Memory, RAM) 702 and a Read Only Memory (ROM) 703, and a system bus 705 connecting the system Memory 704 and the central processing unit 701. Computer device 700 also includes a basic Input/Output system (I/O) 706, which helps to transfer information between various devices within the computer, and a mass storage device 707 for storing an operating system 713, application programs 714, and other program modules 715.

The basic input/output system 706 includes a display 708 for displaying information and an input device 709, such as a mouse, keyboard, or the like, for a user to input information. Wherein both the display 708 and the input device 709 are coupled to the central processing unit 701 through an input output controller 710 coupled to the system bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 710 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 707 is connected to the central processing unit 701 through a mass storage controller (not shown) connected to the system bus 705. The mass storage device 707 and its associated computer-readable media provide non-volatile storage for the computer device 700. That is, the mass storage device 707 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory, electrically erasable programmable read-only memory), flash memory or other solid state memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 704 and mass storage device 707 described above may be collectively referred to as memory.

According to various embodiments of the application, the computer device 700 may also operate by a remote computer connected to the network through a network, such as the Internet. I.e., the computer device 700 may be connected to the network 712 through a network interface unit 711 coupled to the system bus 705, or other types of networks or remote computer systems (not shown) may be coupled using the network interface unit 711.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the method of enhancing a speech signal described above.

In an exemplary embodiment, a computer readable storage medium is also provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which when executed by a processor, implement the above-mentioned method of enhancing a speech signal.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the above-described enhancement processing method of the voice signal.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limiting.

The foregoing description of the exemplary embodiments of the application is not intended to limit the application to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Claims

1. A method for enhanced processing of a speech signal, the method comprising:

acquiring a target voice signal;

performing enhancement processing on the target voice signal by adopting a first voice enhancement mode to obtain a first enhancement signal;

performing downsampling processing on the target voice signal to obtain a downsampled voice signal;

performing enhancement processing on the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhancement signal, wherein the sampling rate of the second voice enhancement mode is smaller than that of the first voice enhancement mode;

determining a target voice enhancement mode according to a reference enhancement signal, wherein the reference enhancement signal comprises the first enhancement signal and the second enhancement signal;

2. The method of claim 1, wherein determining the target speech enhancement mode from the reference enhancement signal comprises:

Extracting a third enhancement signal from the first enhancement signal according to the frequency range of the second enhancement signal, wherein the frequency range of the third enhancement signal is the same as the frequency range of the second enhancement signal;

and determining the target voice enhancement mode according to the third enhancement signal and the second enhancement signal.

3. The method of claim 2, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal;

the determining the target voice enhancement mode according to the third enhancement signal and the second enhancement signal comprises the following steps:

calculating a correlation coefficient of the third enhancement signal and the second enhancement signal;

determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and the second speech enhancement mode for the second signal portion if the correlation coefficient is greater than a first threshold;

Determining the target speech enhancement mode includes employing the first speech enhancement mode on the target speech signal if the correlation coefficient is less than a first threshold.

4. The method of claim 1, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal;

the determining the target voice enhancement mode according to the reference enhancement signal comprises the following steps:

acquiring a target frequency range, wherein the target frequency range comprises at least one frequency;

determining, for a first frequency of the at least one frequency, a gain of the first enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency;

adjusting the value of gain counting according to the magnitude relation between the gain of the first enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency; if the gain of the first enhancement signal at the first frequency is greater than the gain of the second enhancement signal at the first frequency, adding one to the gain count; if the gain of the first enhancement signal at the first frequency is smaller than the gain of the second enhancement signal at the first frequency, subtracting one from the gain count;

Determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion if the gain count value for completing the adjustment process is greater than zero;

and under the condition that the value of the gain count after finishing the adjustment process is smaller than zero, determining the target voice enhancement mode comprises adopting the first voice enhancement mode for the target voice signal.

5. The method of claim 1, wherein the second speech enhancement mode has a sampling rate that is one-half the sampling rate of the first speech enhancement mode.

6. The method of claim 1, wherein the first speech enhancement mode comprises speech enhancement based on a gated loop unit, GRU; the second speech enhancement mode includes speech enhancement based on a long short term memory network LSTM.

7. The method of claim 1, wherein determining the target speech enhancement mode from the reference enhancement signal comprises:

performing pitch period estimation on the reference enhanced signal to obtain a pitch period of the reference enhanced signal;

And determining the target voice enhancement mode according to the pitch period of the reference enhancement signal.

8. The method of claim 7, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the reference enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the reference signal portion in the frequency range of the target speech signal;

the determining the target speech enhancement mode according to the pitch period of the reference enhancement signal comprises the following steps:

determining a pitch of the reference enhancement signal according to a pitch period of the reference enhancement signal;

determining the target speech enhancement mode includes employing a first speech enhancement mode for the first signal portion and a second speech enhancement mode for the second signal portion if the pitch of the reference enhancement signal is greater than a second threshold; the sampling rate of the second voice enhancement mode is smaller than that of the first voice enhancement mode;

in the case that the pitch of the reference enhancement signal is less than a second threshold, determining the target speech enhancement mode includes employing a first speech enhancement mode on the target speech signal.

9. The method of claim 1, wherein the target speech signal comprises a first signal portion and a second signal portion, the second signal portion having a frequency range that is the same as a frequency range of the second enhancement signal, the first signal portion having a frequency range that is other than the frequency range of the second signal portion in the frequency range of the target speech signal;

under the condition that the correlation coefficient is smaller than a first threshold value, estimating the pitch period of the second enhanced signal to obtain the pitch period of the second enhanced signal;

Determining a pitch of the second enhancement signal based on a pitch period of the second enhancement signal;

determining the target speech enhancement mode includes employing the first speech enhancement mode for the first signal portion and employing the second speech enhancement mode for the second signal portion if the pitch of the second enhancement signal is greater than a second threshold;

acquiring a target frequency range, which includes at least one frequency, in case the pitch of the second enhancement signal is smaller than a second threshold;

determining, for a first frequency of the at least one frequency, a gain of the third enhancement signal at the first frequency and a gain of the second enhancement signal at the first frequency;

adjusting the value of gain counting according to the magnitude relation between the gain of the third enhancement signal at the first frequency and the gain of the second enhancement signal at the first frequency; if the gain of the third enhancement signal at the first frequency is greater than the gain of the second enhancement signal at the first frequency, adding one to the gain count; if the gain of the third enhancement signal at the first frequency is smaller than the gain of the second enhancement signal at the first frequency, subtracting one from the gain count;

10. The method of any of claims 1 to 9, wherein the target speech signal comprises an ultra wideband speech signal.

11. An apparatus for enhancement processing of a speech signal, the apparatus comprising:

the reference signal determining module is used for enhancing the target voice signal in a first voice enhancement mode to obtain a first enhanced signal; performing downsampling processing on the target voice signal to obtain a downsampled voice signal; performing enhancement processing on the downsampled voice signal by adopting a second voice enhancement mode to obtain a second enhancement signal, wherein the sampling rate of the second voice enhancement mode is smaller than that of the first voice enhancement mode;

The enhancement mode determining module is used for determining a target voice enhancement mode according to the reference enhancement signal; wherein the reference enhancement signal comprises the first enhancement signal and the second enhancement signal;

12. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the method of enhancing a speech signal according to any one of claims 1 to 10.

13. A computer-readable storage medium, in which at least one program is stored, the at least one program being loaded and executed by a processor to implement the method of enhancing a speech signal according to any one of claims 1 to 10.