CN112233688A

CN112233688A - Audio noise reduction method, device, equipment and medium

Info

Publication number: CN112233688A
Application number: CN202011017669.3A
Authority: CN
Inventors: 郝斌; 冯大航; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-01-15
Anticipated expiration: 2040-09-24
Also published as: CN112233688B

Abstract

The disclosure provides an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and an audio noise reduction medium, and belongs to the technical field of audio processing. The method comprises the following steps: determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised; extracting a noise spectrum of the audio frame to be denoised according to the noise existence probability; and removing the noise in the audio frame to be denoised according to the noise spectrum to obtain a target audio frame. According to the technical scheme provided by the embodiment of the disclosure, the noise existence probability is determined according to the characteristics of low-frequency and high-frequency distribution of the impact noise, so that a noise spectrum is extracted for noise reduction, the noise attenuation condition is not needed to be analyzed by a plurality of frames of audio information, the delay of the noise reduction step of the audio frame to be subjected to noise reduction is small, the impact noise can be effectively inhibited under the number of small delay frames, and the noise reduction effect is better.

Description

Audio noise reduction method, device, equipment and medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio denoising method, apparatus, device, and medium.

Background

In recent years, with the continuous development of audio processing technology, intelligent voice interaction systems such as intelligent sound boxes and vehicle-mounted voice interaction systems are continuously popularized, and the intelligent voice interaction systems receive audio including user voice and process the audio, so that the user voice in the audio is recognized, and man-machine interaction is realized. In practical use, different kinds of noise are often mixed in the audio received by the intelligent voice interaction system, so that the audio needs to be subjected to noise reduction processing firstly.

In the related art, the audio noise reduction method is generally: and selecting the minimum smooth power spectrum in the frames as the smooth power spectrum needing to track the minimum value according to the audio information of the current frame and the future frames by utilizing the characteristic of rapid attenuation of the impact noise.

In the above method, when the number of delay frames is set to be small, a number of voices are estimated to be impulse noise, thereby causing voice damage, and thus, the noise reduction effect is poor.

Disclosure of Invention

The embodiment of the disclosure provides an audio noise reduction method, an audio noise reduction device, audio noise reduction equipment and an audio noise reduction medium, and improves the noise reduction effect. The technical scheme is as follows:

in one aspect, an audio noise reduction method is provided, and the method includes:

determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;

extracting a noise spectrum of the audio frame to be denoised according to the noise existence probability;

and removing the noise in the audio frame to be denoised according to the noise spectrum to obtain a target audio frame.

In one possible implementation manner, the determining, according to a ratio of low-frequency energy to high-frequency energy in an audio frame to be denoised, a noise existence probability of the audio frame to be denoised includes:

determining a first noise existence probability as a noise existence probability of the audio frame to be denoised in response to the fact that the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised is greater than a ratio threshold;

and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the fact that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.

In one possible implementation manner, the determining of the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be denoised includes:

acquiring a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised;

and determining the ratio of the low-frequency energy mean value to the high-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.

In a possible implementation manner, the removing, according to the noise spectrum, noise in the audio frame to be denoised to obtain a target audio frame includes:

adjusting the noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the noise spectrum;

and removing the noise in the audio frame to be denoised according to the adjusted noise spectrum to obtain a target audio frame.

In one possible implementation, the adjusting the noise spectrum according to a ratio of high-frequency energy to low-frequency energy in the noise spectrum includes:

determining the voice existence probability of the noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the noise spectrum;

and extracting noise in the noise spectrum according to the voice existence probability to obtain an adjusted noise spectrum.

In one possible implementation, the determining the speech existence probability of the noise spectrum according to the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum includes:

and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum is larger than a target threshold, determining the voice existence probability corresponding to the high-frequency band as a first voice existence probability, and determining the voice existence probability corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.

In one possible implementation, the determining of the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum includes:

acquiring a high-frequency energy mean value and a low-frequency energy mean value in the noise spectrum;

and taking the ratio of the high-frequency energy mean value to the low-frequency energy mean value as the proportion of the high-frequency energy to the low-frequency energy in the noise spectrum.

determining whether each frequency point in a plurality of frequency points of the noise spectrum has noise;

and determining the proportion of high-frequency energy and low-frequency energy in the noise spectrum according to the noise existence result of each frequency point.

In a possible implementation manner, the determining, according to the noise existence result of each frequency point, a ratio of high-frequency energy to low-frequency energy in the noise spectrum includes:

and according to the weight of each frequency point, carrying out weighted summation on the noise existence result of each frequency point to obtain the proportion of high-frequency energy and low-frequency energy in the noise spectrum.

In one possible implementation manner, the determining whether noise exists in each of a plurality of frequency points of the noise spectrum includes:

acquiring the magnitude relation between the amplitude of each frequency point in the multiple frequency points of the noise spectrum and an amplitude threshold;

for any frequency point, responding to the condition that the amplitude of the frequency point is greater than or equal to an amplitude threshold value, and determining that the frequency point comprises noise;

and for any frequency point, responding to the condition that the amplitude of the frequency point is smaller than an amplitude threshold value, and determining that the frequency point does not contain noise.

In one aspect, an audio noise reduction apparatus is provided, the apparatus comprising:

the determining module is used for determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;

the extraction module is used for extracting the noise spectrum of the audio frame to be denoised according to the noise existence probability;

and the noise reduction module is used for removing the noise in the audio frame to be subjected to noise reduction according to the noise spectrum to obtain a target audio frame.

In one possible implementation, the determining module is configured to:

In one possible implementation, the determining module is further configured to:

In one possible implementation, the noise reduction module is configured to:

In a possible implementation manner, the noise reduction module is configured to perform weighted summation on the noise existence result of each frequency point according to the weight of each frequency point, so as to obtain a ratio of high-frequency energy to low-frequency energy in the noise spectrum.

In one possible implementation, the noise reduction module is configured to:

In one aspect, a computer device is provided and includes one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the above-described audio noise reduction method.

In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, the program code being loaded and executed by a processor to implement the operations performed by the above-mentioned audio noise reduction method.

In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer-readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the audio noise reduction method of any one of the above possible embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the disclosure at least can include:

according to the technical scheme provided by the embodiment of the disclosure, the noise existence probability is analyzed according to the characteristics of low-frequency and high-frequency distribution of the impact noise, and the noise spectrum is extracted according to the noise existence probability, so that the noise of a single audio frame can be reduced, the noise attenuation condition is not analyzed by a plurality of frames of audio information, the delay of the noise reduction step of the audio frame to be reduced is smaller, the impact noise can be effectively inhibited under the number of smaller delay frames, and the noise reduction effect is better.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of an audio noise reduction system provided by an embodiment of the present disclosure;

fig. 2 is a flowchart of an audio denoising method provided by an embodiment of the present disclosure;

fig. 3 is a flowchart of an audio denoising method provided by an embodiment of the present disclosure;

FIG. 4 is a speech spectrogram containing impact noise according to an embodiment of the present disclosure;

FIG. 5 is a graph of an estimated noise spectrum provided by the related art;

FIG. 6 is a graph of an estimated noise spectrum provided by embodiments of the present disclosure;

fig. 7 is a graph of an adjusted noise spectrum provided by an embodiment of the present disclosure;

FIG. 8 is a spectrogram of a noise-reduced target audio frame provided by the related art;

fig. 9 is a spectrogram of a noise-reduced target audio frame according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an audio noise reduction device provided in an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present disclosure more apparent, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an audio noise reduction system provided by an embodiment of the present disclosure, and referring to fig. 1, the audio noise reduction system may include a speech acquisition device 110 and a computer device 120, or may be the computer device 120 alone.

When the audio noise reduction system includes the speech capturing device 110 and the computer device 120, the speech capturing device 110 may be connected to the computer device 120 through a network or a data line. The voice capturing device 110 may have a voice capturing function, and may capture audio to be denoised. In one possible implementation, the application scenario implemented by the present disclosure may be a relatively sharp impact noise in the environment of the communication system, for example, an impact noise of a pen falling or a keyboard being strongly knocked. Of course, it can also be a relatively sharp impact noise in other application scenarios. The computer device 120 may have an audio processing function, and may perform noise reduction processing on the audio to be noise reduced acquired by the voice acquisition device 110.

When the audio noise reduction system only includes the computer device 120, the computer device 120 may have a voice capture function and an audio processing function, and the computer device 120 may capture audio to be noise reduced in various environments and perform noise reduction processing on the audio to be noise reduced.

In one possible implementation manner, the computer device 120 may be a terminal or a server, which is not limited in this disclosure.

Fig. 2 is a flowchart of an audio denoising method provided by an embodiment of the present disclosure, which is applied to a computer device, and referring to fig. 2, the method includes:

201. and determining the noise existence probability of the audio frame to be denoised according to the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be denoised.

202. And extracting the noise spectrum of the audio frame to be denoised according to the noise existence probability.

203. And removing the noise in the audio frame to be denoised according to the noise spectrum to obtain a target audio frame.

According to the method provided by the embodiment of the disclosure, the existence probability of the noise is determined according to the characteristics of low-frequency and high-frequency distribution of the impact noise, so that a noise spectrum is extracted for a noise reduction step, and the noise attenuation condition is not needed to be analyzed by a plurality of frames of audio information, so that the delay of the noise reduction step of the audio frame to be subjected to noise reduction is small, the impact noise can be effectively inhibited under the number of small delay frames, and the noise reduction effect of the noise reduction method is better.

In one possible implementation manner, the determining the noise existence probability of the audio frame to be denoised according to the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be denoised comprises:

and determining a second noise existence probability as the noise existence probability of the audio frame to be subjected to noise reduction in response to the condition that the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction is smaller than or equal to a ratio threshold, wherein the first noise existence probability is smaller than the second noise existence probability.

In one possible implementation manner, the determining of the ratio of the low-frequency energy and the high-frequency energy in the audio frame to be denoised includes:

and determining the ratio of the low-frequency energy average value to the high-frequency energy average value as the ratio of the high-frequency energy to the low-frequency energy in the audio frame to be denoised.

In one possible implementation manner, the removing noise in the audio frame to be denoised according to the noise spectrum to obtain a target audio frame includes:

and removing the noise in the audio frame to be denoised according to the adjusted noise spectrum to obtain the target audio frame.

and extracting the noise in the noise spectrum according to the voice existence probability to obtain an adjusted noise spectrum.

and in response to the fact that the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum is larger than a target threshold value, determining the voice existence probability corresponding to the high-frequency band as a first voice existence probability, and determining the voice existence probability corresponding to the low-frequency band as a second voice existence probability, wherein the first voice existence probability is smaller than the second voice existence probability.

In one possible implementation, the determining of the ratio of the high frequency energy to the low frequency energy in the noise spectrum includes:

acquiring a medium-high frequency energy mean value and a low-frequency energy mean value of the noise spectrum;

In a possible implementation manner, the determining, according to the result of the existence of the noise in each frequency point, a ratio of high-frequency energy to low-frequency energy in the noise spectrum includes:

In one possible implementation, the determining whether noise exists in each of the plurality of frequency points of the noise spectrum includes:

acquiring the magnitude relation between the amplitude of each frequency point in a plurality of frequency points of the noise spectrum and an amplitude threshold;

for any frequency point, in response to the fact that the amplitude of the frequency point is larger than or equal to the amplitude threshold value, determining that the frequency point comprises noise;

and for any frequency point, responding to the condition that the amplitude of the frequency point is smaller than the amplitude threshold value, and determining that the frequency point does not contain noise.

Fig. 3 is a flowchart of an audio denoising method provided in an embodiment of the present disclosure, and referring to fig. 3, the method may include:

301. the computer device obtains an audio frame to be denoised.

In the embodiment of the present disclosure, the computer device may be a terminal or a server. The audio frame to be denoised can be one frame of audio with noise acquired in various scenes, for example, in vehicles such as automobiles, ships, airplanes and the like, the acquired speech to be denoised can include wind noise in automobile windowing high-speed driving or rain noise in automobile driving in rainy days; in a home environment, the acquired audio frame to be denoised may include noise of a television or rotation noise of a washing machine, which is not limited in the embodiment of the present disclosure.

The way of acquiring the audio frame to be denoised by the computer device may be various, and in one possible implementation, the determination process may include any one of the following ways one to three:

in the first mode, the computer device directly acquires the audio frame to be denoised.

The computer device may have a voice collecting function, and the computer device may directly collect sound to obtain the audio frame to be denoised.

And secondly, the computer equipment acquires the audio frame to be denoised, which is acquired by the voice acquisition equipment.

The computer device may be connected to the voice collecting device through a network or a data line to obtain the to-be-denoised audio frame collected by the voice collecting device, and the voice collecting device may be any kind of device having a voice collecting function, which is not limited in the embodiments of the present disclosure.

And thirdly, the computer equipment can extract the audio frame to be subjected to noise reduction from the database.

In the third mode, the audio frame to be denoised may be stored in a database, and when the computer device needs to process the audio frame to be denoised, the audio frame to be denoised is extracted from the database.

It should be noted that the computer device may obtain the audio to be denoised, so as to denoise each frame of the audio to be denoised, obtain a target audio frame corresponding to each frame, and further obtain the target audio after denoising the audio to be denoised. The process of denoising an audio frame to be denoised by the computer device is only described here, and the process of denoising other audio frames to be denoised is similar to the above process, and is not repeated here.

302. The computer device determines a ratio of low frequency energy to high frequency energy in an audio frame to be denoised.

When the audio frame is processed, because the noise in the audio frame to be denoised is uniformly distributed on the frequency domain, the noise is converted into the frequency domain for calculation, the calculation difficulty is reduced, simultaneously, the noise can be more effectively estimated, the noise is more effectively removed, and the denoising effect is improved. Therefore, the computer device may first obtain the frequency spectrum of the audio frame to be denoised, and then determine the ratio according to the frequency spectrum.

When the computer device obtains the above ratio, the computer device may analyze the low frequency energy and the high frequency energy in the audio frame to be denoised according to the frequency spectrum, so as to determine the ratio of the two. This process may be implemented in a number of ways, one possible implementation of which is provided below.

In this implementation manner, the computer device may obtain a low-frequency energy mean value and a high-frequency energy mean value in the audio frame to be denoised according to the frequency spectrum of the audio frame to be denoised, and determine a ratio of the low-frequency energy mean value and the high-frequency energy mean value as a ratio of high-frequency energy and low-frequency energy in the audio frame to be denoised.

The low-frequency energy mean value can represent the condition of low-frequency energy in the audio frame, and the high-frequency energy mean value can represent the condition of high-frequency energy in the audio frame. Furthermore, the computer device obtains the ratio of the low frequency energy and the high frequency energy, and the ratio can be used as the ratio of the low frequency energy to the high frequency energy.

For example, the computer device performs time-frequency domain conversion on the audio frame to be denoised, converts the time-domain signal of the audio frame to be denoised into a frequency spectrum, and assumes a Fast Fourier Transform (FFT) length of 512, an Overlap and add (overlay)&add) length 256, sample rate 16kHz for example, and audio amplitude squared represented by Ya2, the computer device will then determine the audio frequency to be denoised based on thisThe frequency spectrum of the frame can be obtained through the following formula I and formula II to obtain the low-frequency energy mean value E_lowAnd mean value of high frequency energy E_high：

Obtaining low-frequency energy mean value E by computer equipment_lowAnd mean value of high frequency energy E_highThen, the ratio slope of the two can be obtained by the following formula three:

slope＝E_low/E_highequation three

For example, the process may also be implemented in other manners, for example, by obtaining a sum of audio amplitudes of low-frequency points and a sum of audio amplitudes of high-frequency points in the spectrum, and taking a ratio of the two sums as the ratio. The embodiments of the present disclosure are not limited thereto.

303. And the computer equipment determines the noise existence probability of the audio frame to be subjected to noise reduction according to the ratio of the low-frequency energy to the high-frequency energy in the audio frame to be subjected to noise reduction.

Through the steps, the computer equipment obtains the proportion of the low-frequency energy and the high-frequency energy in the audio frame to be denoised, the proportion can represent the distribution condition of the low-frequency energy and the high-frequency energy in the audio frame to be denoised, and because the impact noise generally has short occurrence time, concentrated energy and uniform frequency distribution, if the low-frequency energy is more, more voice can be indicated, and if the high-frequency energy is more, more impact noise can be indicated. Thus, by this ratio, the computer device may determine the noise presence probability of the audio frame to be denoised.

In one possible implementation, a proportion threshold may be set, and the noise existence probability may be determined by comparing the magnitude relationship between the proportion and the proportion threshold. The magnitude relation of the two can include two kinds, and different existing probabilities of noise can be set respectively.

In particular, in the implementation, in a first case, the computer device may determine the first noise existence probability as the noise existence probability of the audio frame to be denoised in response to a ratio of low frequency energy to high frequency energy in the audio frame to be denoised being greater than a ratio threshold. In a second case, the computer device may determine a second noise existence probability as the noise existence probability of the audio frame to be denoised in response to the ratio of the low frequency energy and the high frequency energy in the audio frame to be denoised being less than or equal to a ratio threshold, the first noise existence probability being less than the second noise existence probability.

The proportional threshold may be set by a relevant technician as required, and may be an empirical threshold, and the specific value of the proportional threshold is not limited in the embodiment of the present disclosure.

For example, the ratio is still expressed by slope as an example, assuming that the ratio threshold is T, if slope > T, it indicates that low-frequency energy is concentrated in the signal, the probability of being human voice is high, and thus the noise existence probability can be set to be lower. If slope < T, it means that the high frequency energy in the signal is concentrated, the probability of noise is high, and therefore, the probability of noise existence can be set to be higher.

The noise existence probability is inversely related to the ratio of the low-frequency energy to the high-frequency energy, in a possible implementation manner, a plurality of candidate noise existence probabilities may be preset, when slope is greater than T, the computer device may set the noise existence probability to be the smallest of the plurality of candidate noise existence probabilities, and when slope is less than T, the noise existence probability may be set to be other larger candidate noise existence probabilities.

304. And the computer equipment extracts the noise spectrum of the audio frame to be denoised according to the noise existence probability.

If the computer device determines the probability of the existence of noise, a noise spectrum can be extracted from the audio frame to be denoised. The extraction process of the noise spectrum can be implemented in various ways, and in one possible implementation, the extraction process of the noise spectrum can be implemented by optimal Modified Log-Spectral Amplitude Estimator (OMLSA) calculation and Modified least control Recursive Averaging algorithm (IMCRA) calculation.

In the calculation process of the OMLSA and the IMCRA, the noise existence probability determined in the steps is applied, and the noise spectrum of the audio frame to be subjected to noise reduction is extracted and obtained through the audio frame to be subjected to noise reduction and the information of the front and rear audio frames. Specifically, the computer device may obtain a power spectrum after an audio frame to be denoised is smoothed and a power spectrum of the previous frame, so as to track a minimum value of the smoothed power spectrum, so that the sword learns a speech existence situation by using a noise existence probability, compensates for a noise spectrum minimum value estimation, finally obtains a more accurate noise spectrum, and adjusts the noise spectrum by OMLSA calculation.

For example, fig. 4 shows a speech spectrogram containing impulse noise, which generally occurs in a short time, is concentrated in energy, and has a relatively uniform frequency distribution, as shown in fig. 4. With the noise reduction method provided by the related art, the estimated noise spectrum can be as shown in fig. 5, and when the delay frame number is set to a smaller value, many human voices are identified as noise, and are not suitable for an application scenario with high delay requirement, and the noise reduction effect is poor. By the noise reduction method provided by the disclosure, the obtained noise spectrum can be as shown in fig. 6, so that the human voice component in the impact noise estimation is greatly reduced, and the noise reduction effect is better.

305. The computer device adjusts the noise spectrum according to the ratio of the high frequency energy to the low frequency energy in the noise spectrum.

After the computer device obtains the noise spectrum corresponding to the audio frame to be denoised, the noise spectrum may also contain a small part of human voices, for example, utterances of types [ s ], [ z ], [ dz ], [ ts ], and then the human voice included in the noise spectrum may be analyzed by analyzing the distribution of high and low frequency energies in the noise spectrum, so as to adjust the noise spectrum to remove the human voices in the noise spectrum.

In one possible implementation, the computer device may adjust the noise spectrum by steps one and two as follows.

Step one, determining the voice existence probability of the noise spectrum according to the proportion of high-frequency energy and low-frequency energy in the noise spectrum.

In the first step, whether the noise spectrum includes the human voice or not can be represented by the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum, so as to determine whether a large amount of human voice is included or not, and obtain the voice existence probability.

In one possible implementation, a target threshold may be set to measure whether high frequency energy concentrations are present in the noise spectrum to determine whether there are [ s ], [ z ], [ dz ], [ ts ] type utterances. In this implementation, the computer device may determine, in response to a ratio of high-frequency energy to low-frequency energy in the noise spectrum being greater than a target threshold, a speech presence probability corresponding to a high-frequency band as a first speech presence probability and a speech presence probability corresponding to a low-frequency band as a second speech presence probability, the first speech presence probability being less than the second speech presence probability.

The high-frequency energy of the impact noise is high, so that the voice existence probability corresponding to the high-frequency band can be set to be a small value, and in order to avoid reducing voice damage, the voice existence probability corresponding to the low-frequency band is set to be a large value.

In another possible implementation, the computer device may determine, in response to a ratio of the high-frequency energy to the low-frequency energy in the noise spectrum being greater than a target threshold, a noise existence probability in the noise spectrum as a target noise existence probability, where the target noise existence probability is a candidate noise existence probability with a smallest value.

The proportion of the high-frequency energy and the low-frequency energy in the noise spectrum can be obtained in various ways, two possible implementation ways are provided in the next year, the computer device can determine the proportion in any way and can also determine the proportion in other ways, and the embodiment of the disclosure does not limit which way is specifically adopted.

In the first mode, the computer equipment acquires the high-frequency energy mean value and the low-frequency energy mean value in the noise spectrum, and the ratio of the high-frequency energy mean value to the low-frequency energy mean value is used as the proportion of the high-frequency energy to the low-frequency energy in the noise spectrum.

For example, assume that λ is used for the noise spectrum estimated in the above step 304_tAnd (4) showing. The low-frequency energy average Et of the noise spectrum can be obtained by the following formula four and formula five_lowAnd mean value of high frequency energy Et_high。

The computer equipment can obtain the ratio slope of the high-frequency energy mean value and the low-frequency energy mean value through the following formula six_t：

slope_t＝Et_high/Et_lowEquation six

If slope_t＞T_tThe high-frequency energy concentration of the signal at this time is shown as [ s ]]、[z]、[dz]、[ts]The probability of the type of utterance is high, the noise existence probability can be set to the minimum value

Wherein, the T_tThe target threshold may be set by a relevant technician as required, and may be an empirical threshold, and the specific value of the target threshold is not limited in the embodiment of the present disclosure.

And secondly, determining whether noise exists in each frequency point of the multiple frequency points of the noise spectrum by the computer equipment, and determining the proportion of high-frequency energy and low-frequency energy in the noise spectrum according to the noise existence result of each frequency point.

In the second mode, when the above proportion is obtained, the determination may be performed according to whether each frequency point is noise, wherein when it is determined whether each frequency point has noise, the computer device may obtain a magnitude relation between the magnitude of each frequency point in the multiple frequency points of the noise spectrum and the magnitude threshold, for any frequency point, it is determined that the frequency point includes noise in response to the magnitude of the frequency point being greater than or equal to the magnitude threshold, and for any frequency point, it is determined that the frequency point does not include noise in response to the magnitude of the frequency point being less than the magnitude threshold.

For example, the noise spectrum λ for each frame_tAnd each frequency point is judged once, and the judgment can be realized by the following formula seven:

wherein, the T_t1Is the amplitude threshold. And judging whether the noise exists or not by judging whether the amplitude of each frequency point is smaller than an amplitude threshold value or not.

After determining the noise existence result of each frequency point, the computer device may perform weighted summation on the noise existence result of each frequency point according to the weight of each frequency point to obtain the ratio of high-frequency energy to low-frequency energy in the noise spectrum.

For example, a weight array w _ i, i is 0, 1, …, 256, where the weight of 0-50 needs to be smaller and the weight of 150-200 needs to be larger (for example, the weight of 0-50 frequency points is 0.8, 51-100: 0.85, 101-150: 0.95, 151-200: 0.97, 201-256: 0.97). For the result that the noise exists at each frequency point is determined, weighted summation can be performed through the following formula eight, and the ratio D of high-frequency energy and low-frequency energy in the noise spectrum is obtained.

If D > T_dIt is said that the frame impact noise energy level is high, and thus, the speech existence probability PH1(i) may be set to 0.7, i to 0, …, 126, PH1(i) to 0.1, i to 127, …, 257. Wherein, T_dI.e. the target threshold. Wherein, T_t1And T_dThe former is used to judge whether some frequency point contains impact noise, and the latter is used to judge the energy level of the current frame impact noise. Due to the fact thatThe high frequency energy of the attack noise is high, therefore PH1 can be set to a small value, and in order to avoid reducing voice damage, the PH1 value corresponding to the low frequency band is relatively large.

For example, as shown in fig. 7, after the adjustment in step 305, more human voice components are removed from the adjusted noise spectrum, and more accurately, the noise removal step is performed using such a noise spectrum, so that a better noise reduction effect can be achieved.

And step two, extracting the noise in the noise spectrum according to the existence probability of the voice to obtain the adjusted noise spectrum.

After the computer device determines the existence probability of the speech, the noise may be extracted again to obtain an adjusted noise spectrum, it should be noted that the process may also be implemented by OMLSA calculation and IMCRA calculation, which is the same as step 304 described above, and details are not repeated here.

306. And removing the noise in the audio frame to be denoised by the computer equipment according to the adjusted noise spectrum to obtain the target audio frame.

And the computer equipment obtains the adjusted noise spectrum, namely removing noise to obtain the noise-reduced target audio frame. Specifically, the computer device may perform OMLSA calculation and IMCRA calculation on the audio frame to be denoised to obtain the target audio frame, with the adjusted noise spectrum as a part of the noise spectrum. The same process as that in step 304 is repeated here.

For example, as shown in fig. 8 and 9, fig. 8 shows a target audio frame obtained by noise reduction directly from audio frame information in the related art. Fig. 9 shows a target audio frame obtained after noise reduction by the noise reduction method provided by the present disclosure. It can be seen from the figure that the noise reduction method provided by the disclosure removes the high-frequency impact noise, better retains the human voice component, does not remove the high-frequency plosive and the like, greatly reduces the voice damage, and has better noise reduction effect. And the noise reduction mode provided by the disclosure performs noise reduction according to the characteristics of low-high frequency distribution of the impact noise without referring to audio information of many frames, so that the impact noise can be effectively suppressed under a few delay frames, and the voice is less damaged.

The foregoing steps 305 to 306 are processes of removing noise in the audio frame to be denoised according to the noise spectrum to obtain the target audio frame, and the foregoing is only described by taking an example that after the noise spectrum is extracted according to the noise existence probability, the noise spectrum is further adjusted based on the high-low frequency energy distribution in the noise spectrum, and then noise removal is performed, in a possible implementation manner, after the step 304, the computer device may further perform a noise removal step directly with the noise spectrum obtained in the step 304 to obtain the target audio frame. The embodiments of the present disclosure are not limited thereto.

Fig. 10 is a schematic structural diagram of an audio noise reduction device provided in an embodiment of the present disclosure, and referring to fig. 10, the device includes:

a determining module 1001, configured to determine a noise existence probability of the audio frame to be denoised according to a ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised;

an extracting module 1002, configured to extract a noise spectrum of the audio frame to be denoised according to the noise existence probability;

and a denoising module 1003, configured to remove noise in the audio frame to be denoised according to the noise spectrum, to obtain a target audio frame.

In one possible implementation, the determining module 1001 is configured to:

In one possible implementation, the determining module 1001 is further configured to:

In one possible implementation, the noise reduction module 1003 is configured to:

In a possible implementation manner, the denoising module 1003 is configured to perform weighted summation on the noise existence result of each frequency point according to the weight of each frequency point, so as to obtain a ratio of high-frequency energy to low-frequency energy in the noise spectrum.

The device provided by the embodiment of the disclosure determines the existence probability of noise according to the characteristics of low-frequency and high-frequency distribution of impact noise, so that a noise spectrum is extracted to perform a noise reduction step, a plurality of frames of audio information are not needed to analyze the noise attenuation condition, the delay of the noise reduction step of an audio frame to be subjected to noise reduction is small, the impact noise can be effectively inhibited under the number of small delay frames, and the noise reduction effect is better.

It should be noted that: in the audio noise reduction apparatus provided in the above embodiment, when reducing noise, only the division of the above functional modules is used for illustration, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to complete all or part of the above described functions. In addition, the audio noise reduction device and the audio noise reduction method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure. The terminal 1100 may be: a smart phone, a tablet computer, an MP3(Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4) player, a notebook computer or a desktop computer. Terminal 1100 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth. The terminal can also be a voice intelligent terminal embedded device installed on the central control.

In general, terminal 1100 includes: one or more processors 1101 and one or more memories 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one program code for execution by processor 1101 to implement the audio noise reduction methods provided by method embodiments in the present disclosure.

In some embodiments, the terminal 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, positioning assembly 1108, and power supply 1109.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 1104 may also include NFC (Near Field Communication) related circuits, which are not limited by this disclosure.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, providing the front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in still other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1100. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.

Positioning component 1108 is used to locate the current geographic position of terminal 1100 for purposes of navigation or LBS (Location Based Service). The Positioning component 1108 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union galileo System.

Power supply 1109 is configured to provide power to various components within terminal 1100. The power supply 1109 may be alternating current, direct current, disposable or rechargeable. When the power supply 1109 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1100 can also include one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.

Acceleration sensor 1111 may detect acceleration levels in three coordinate axes of a coordinate system established with terminal 1100. For example, the acceleration sensor 1111 may be configured to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1111. The acceleration sensor 1111 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 may cooperate with the acceleration sensor 1111 to acquire a 3D motion of the user with respect to the terminal 1100. From the data collected by gyroscope sensor 1112, processor 1101 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1113 may be disposed on a side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1113 is disposed on the side frame of the terminal 1100, the holding signal of the terminal 1100 from the user can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1114 is configured to collect a fingerprint of the user, and the processor 1101 identifies the user according to the fingerprint collected by the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 1101 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 1114 may be disposed on the front, back, or side of terminal 1100. When a physical button or vendor Logo is provided on the terminal 1100, the fingerprint sensor 1114 may be integrated with the physical button or vendor Logo.

Optical sensor 1115 is used to collect ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the ambient light intensity collected by the optical sensor 1115. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, processor 1101 may also dynamically adjust the shooting parameters of camera assembly 1106 based on the ambient light intensity collected by optical sensor 1115.

Proximity sensor 1116, also referred to as a distance sensor, is typically disposed on a front panel of terminal 1100. Proximity sensor 1116 is used to capture the distance between the user and the front face of terminal 1100. In one embodiment, when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 is gradually decreased, the display screen 1105 is controlled by the processor 1101 to switch from a bright screen state to a dark screen state; when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 becomes progressively larger, the display screen 1105 is controlled by the processor 1101 to switch from a breath-screen state to a light-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present disclosure, where the server 1200 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where at least one program code is stored in the one or more memories 1202, and is loaded and executed by the one or more processors 1201 to implement the audio noise reduction methods provided by the foregoing method embodiments. Certainly, the server 1200 may further have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 1200 may further include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, a computer readable storage medium, such as a memory, including program code executable by a processor to perform the audio noise reduction method in the above embodiments is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises one or more program codes stored in a computer-readable storage medium. The one or more program codes can be read by one or more processors of the computer device from a computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the computer device can execute the above-described audio noise reduction method.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is considered as illustrative of the embodiments of the disclosure and is not to be construed as limiting thereof, and any modifications, equivalents, improvements and the like made within the spirit and principle of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A method for audio noise reduction, the method comprising:

2. The method according to claim 1, wherein the determining the noise existence probability of the audio frame to be denoised according to the ratio of low-frequency energy to high-frequency energy in the audio frame to be denoised comprises:

3. The method according to claim 1, wherein the determining of the ratio of low frequency energy and high frequency energy in the audio frame to be denoised comprises:

4. The method according to claim 1, wherein the removing noise in the audio frame to be denoised according to the noise spectrum to obtain a target audio frame comprises:

5. The method of claim 4, wherein the adjusting the noise spectrum according to the ratio of the high frequency energy to the low frequency energy in the noise spectrum comprises:

6. The method of claim 5, wherein determining the speech presence probability of the noise spectrum according to the ratio of the high frequency energy to the low frequency energy in the noise spectrum comprises:

7. The method of claim 4, wherein the determining the ratio of high frequency energy to low frequency energy in the noise spectrum comprises:

8. The method of claim 4, wherein the determining the ratio of high frequency energy to low frequency energy in the noise spectrum comprises:

9. The method according to claim 8, wherein the determining the ratio of the high-frequency energy to the low-frequency energy in the noise spectrum according to the noise existence result of each frequency point comprises:

10. The method of claim 8, wherein determining whether noise is present in each of the plurality of frequency bins of the noise spectrum comprises:

11. An audio noise reduction apparatus, characterized in that the apparatus comprises a plurality of functional modules for performing the audio noise reduction method of any one of claims 1 to 10.

12. A computer device comprising one or more processors and one or more memories having at least one program code stored therein, the program code being loaded and executed by the one or more processors to implement the operations performed by the audio noise reduction method of any of claims 1 to 10.

13. A computer-readable storage medium having at least one program code stored therein, the program code being loaded and executed by a processor to perform operations performed by the audio noise reduction method of any of claims 1 to 10.