CN113270107A

CN113270107A - Method and device for acquiring noise loudness in audio signal and electronic equipment

Info

Publication number: CN113270107A
Application number: CN202110395202.0A
Authority: CN
Inventors: 吴晨晨
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-08-17
Anticipated expiration: 2041-04-13
Also published as: WO2022218252A1; CN113270107B

Abstract

The application discloses a method and a device for acquiring noise loudness in an audio signal and electronic equipment, and belongs to the technical field of audio signal processing. The method for acquiring the loudness of the noise in the audio signal comprises the following steps: acquiring N subband power spectrums of N audio frames in a target audio signal; obtaining noise power spectrum estimation corresponding to each audio frame in N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums; carrying out smooth updating processing on the noise power spectrum estimation; performing compensation and correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal; wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

Description

Method and device for acquiring noise loudness in audio signal and electronic equipment

Technical Field

The application belongs to the technical field of audio signal processing, and particularly relates to a method and a device for acquiring noise loudness in an audio signal and electronic equipment.

Background

With the development of science and technology, electronic devices capable of realizing a call function are widely used. The speech enhancement algorithm can remove most of interference audio (i.e. noise) in the call, and therefore, the speech enhancement algorithm has a very important meaning for improving the call quality of the electronic equipment.

Noise estimation is one of the crucial elements in speech enhancement. Noise estimation refers to estimating the loudness of noise (i.e., the power spectrum of the noise) in audio signals generated and transmitted during a voice call. Accurate noise estimation of audio signals is a precondition for ensuring the speech enhancement effect.

A disadvantage in the related art is that the degree of deviation between the estimated value of the loudness of noise in an audio signal and the true value is large.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for acquiring noise loudness in an audio signal and electronic equipment, which can solve the technical problem that when a minimum tracking method is adopted for noise estimation, the deviation degree of a noise estimation result is large.

In a first aspect, an embodiment of the present application provides a method for obtaining a loudness of noise in an audio signal, including: acquiring N subband power spectrums of N audio frames in a target audio signal; obtaining noise power spectrum estimation corresponding to each audio frame in N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums; carrying out smooth updating processing on the noise power spectrum estimation; performing compensation and correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal; wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

In a second aspect, an embodiment of the present application provides an apparatus for obtaining loudness of noise in an audio signal, including: the acquisition module is used for acquiring N sub-band power spectrums of N audio frames in a target audio signal; the estimation module is used for obtaining noise power spectrum estimation corresponding to each audio frame in the N audio frames according to the M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums acquired by the acquisition module; the updating module is used for carrying out smooth updating processing on the noise power spectrum estimation obtained by the estimating module; the correction module is used for carrying out compensation correction processing on the noise power spectrum estimation processed by the updating module to obtain the noise loudness of the target audio signal; wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium on which a program or instructions are stored, which when executed by a processor, implement the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In this embodiment of the present application, after obtaining N subband power spectrums of N audio frames in a target audio signal, a noise power spectrum estimation corresponding to each audio frame in the N audio frames may be obtained according to M target power spectrums in each subband power spectrum in the N subband power spectrums. And finally, carrying out smooth updating processing on the noise power spectrum estimation, and carrying out compensation and correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal. In the obtaining method provided in the embodiment of the present application, N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2. In other words, the acquisition method provided by the embodiment of the present application performs statistics on two or more target power spectrums, and thus obtains the noise power spectrum estimation according to the two or more target power spectrums. The method provided by the embodiment of the application can effectively reduce the deviation degree between the noise estimation (namely, the noise loudness obtained through estimation) and the noise true value, and therefore the voice enhancement effect and the voice call quality of the electronic equipment are improved.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for obtaining a loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a second step of a method for obtaining the loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a third step of a method for obtaining the loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a fourth step of a method for obtaining the loudness of noise in an audio signal according to an embodiment of the present application;

fig. 5 is one of the schematic composition diagrams of the apparatus for obtaining the loudness of noise in an audio signal according to the embodiment of the present application;

FIG. 6 is a second schematic diagram illustrating an apparatus for obtaining a loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a second schematic diagram of the electronic device according to the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The following describes in detail, with reference to the accompanying drawings, a method, an apparatus, and an electronic device for obtaining a loudness of noise in an audio signal provided in an embodiment of the present application through specific embodiments and application scenarios thereof.

With the development of science and technology, electronic devices such as mobile phones, personal computers and smart watches capable of realizing a call function are widely used. In order to improve the call quality, it is necessary to adopt a speech enhancement algorithm to remove or filter the interfering audio (i.e., noise) in the call.

To achieve speech enhancement, the loudness of the noise in the call needs to be estimated (i.e., noise estimation). The methods for implementing noise estimation in the related art may include the following methods:

the first method is to realize noise estimation through a time recursive average algorithm, and the method cannot cover the situation that the background noise of the voice segment environment changes, so the method has the defect that the noise estimation is not timely.

The second method is to perform noise estimation based on a histogram, and the statistical steps of the method are performed within a fixed window length and require repeated calculation on all frequency bands, so that the method has a disadvantage of large calculation amount.

The third is to achieve noise estimation by minimum-tracking Algorithms, which can achieve a rough estimate of the noise level by tracking the minimum power spectral band (i.e., the minimum of the spectral power) per audio frame. The method has the advantages of less delay and reasonable calculation amount, and is a relatively ideal noise estimation method.

However, the minimum tracking method has a disadvantage that the degree of deviation of the noise estimation value from the true value of the noise is relatively large. Therefore, how to reduce the deviation degree when the minimum tracking method is used to estimate the noise is an urgent technical problem to be solved by those skilled in the art.

In order to solve the foregoing technical problem, an implementation subject of the method may be an obtaining device, and the obtaining device may be an electronic device, or may also be a functional module and/or a functional subject capable of implementing the obtaining method in the electronic device, and may be determined specifically according to an actual use requirement, which is not limited in the embodiment of the present application. In order to more clearly describe the obtaining method provided by the embodiment of the present application, in the following method embodiment, an example is given in which the obtaining apparatus is an electronic device, that is, an execution subject of the obtaining method is the electronic device.

The electronic device in the embodiment of the present application may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, or a Personal Digital Assistant (PDA), or may be another type of electronic device, which is not limited in the embodiment of the present application.

The following describes in detail a method for acquiring a loudness of noise in an audio signal according to an embodiment of the present application, by taking various embodiments as examples.

As shown in fig. 1, an embodiment of the present application provides an obtaining method of a loudness of noise in an audio signal, where the obtaining method includes the following steps S101 to S104:

s101, an obtaining device obtains N sub-band power spectrums of N audio frames in a target audio signal.

Optionally, N is an integer greater than or equal to 1.

Optionally, the voice call audio signal may be acquired through a Microphone (Mic) of the electronic device, or may be acquired through a main Microphone and an auxiliary Microphone of the electronic device. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

Alternatively, the target audio signal may be a voice audio signal transmitted by an electrical signal, and may also be an audio signal transmitted based on an internet protocol.

It is to be understood that the target audio signal is an audio signal in a voice call audio signal, and the target audio signal may include both a voice audio signal and a noise audio signal.

Alternatively, the target audio signal may be a part of the voice call audio signal or the entire audio signal. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

It is understood that the noise audio signal may include a noise audio signal generated due to a noisy surrounding environment during a voice call, and may also include a noise audio signal caused by an unsatisfactory communication transmission quality during transmission of the target audio signal.

Alternatively, the noise audio signal may be a stationary noise audio signal. The steady-state noise audio signal may be a noise audio signal having a repetition frequency greater than a preset frequency (e.g., 10Hz), and may also be a noise audio signal having a sound level fluctuation not greater than a preset decibel (e.g., 3dB) during measurement.

It is to be understood that the N audio frames may be N consecutive audio frames, or may be N non-consecutive audio frames.

Alternatively, the N audio frames may be a part of the audio frames or all of the audio frames in the target audio signal. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

It is to be understood that each of the N audio frames may correspond to one subband power spectrum, and the N audio frames may correspond to N subband power spectrums.

Alternatively, each of the N subband power spectrums may be a logarithmic value (i.e., log value) of the subband spectral power.

It is understood that each sub-band power spectrum of the N sub-band power spectrums can be used to characterize the distribution of the audio signal power in each sub-band spectrum following the frequency variation.

S102, the obtaining device obtains noise power spectrum estimation corresponding to each audio frame in the N audio frames according to the M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums.

Optionally, M is an integer greater than or equal to 2.

It can be understood that the M target power spectrums are power spectrums used for obtaining noise power spectrum estimates corresponding to each of the N audio frames.

Optionally, the specific number of the M target power spectrums may be determined according to the number of the N audio frames, and may also be determined according to a power value of each subband power spectrum in the N subband power spectrums. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

Alternatively, the noise power spectrum estimate may be derived from the M target power spectra based on the principle and algorithm of minimum tracking. Specifically, the principle of minimum tracking is: it is assumed that the noisy speech power of a single frequency band will typically be attenuated to the power level of the noise, even during speech activity. Accordingly, the electronic device of the embodiment of the application can obtain the noise power spectrum estimation corresponding to each audio frame in the N audio frames according to the M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums.

Optionally, the M target power spectrums are power spectrums falling within a preset power spectrum interval. The M target power spectrums may be all or a part of the power spectrums falling within a preset power spectrum interval.

It can be understood that, based on the basic principle of minimum tracking, the preset power spectrum interval is a power spectrum interval with relatively small or relatively minimum power value in each sub-band power spectrum.

It is understood that the preset power spectrum interval includes at least two or more power spectrums.

Alternatively, the preset power spectrum interval may be a set of the first P% or the first Q power spectrums in the respective sub-band power spectrums in descending order.

Alternatively, the preset power spectrum interval may be a set of the last R% or the last S power spectrums in the respective sub-band power spectrums, which are arranged in order from high to low.

It is understood that P and R are positive numbers, respectively, and Q and S are positive integers, respectively.

For example, the preset power spectrum interval may be a set of the first 10% or 25% of the power spectrums of each sub-band power spectrum in descending order, or may be a set of the first 50 or 80 power spectrums of each sub-band power spectrum in descending order. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

For example, the preset power spectrum interval may be a set of last 5% or 8% of the power spectrums of the respective sub-band power spectrums in order from high to low, and may also be a set of last 30 or last 20 power spectrums of the respective sub-band power spectrums in order from high to low. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

Further optionally, as shown in fig. 2, S102 includes the following S1021 to S1022:

and S1021, the obtaining device gives a weight value to each target power spectrum in the M target power spectrums.

Optionally, the weighted values assigned to the respective target power spectrums may be equal or unequal. The method can be determined according to actual use requirements, and the embodiment of the application is not limited.

And S1022, the obtaining device obtains noise power spectrum estimation according to the M target power spectrums and the weighted values.

It can be understood that the electronic device obtains the noise power spectrum estimation by a weighted average calculation formula according to the M target power spectrums and the weight values.

The purpose of the electronic equipment for respectively giving the weight values to each target power spectrum in the M target power spectrums is to introduce a weight distribution mechanism in the process of obtaining noise power spectrum estimation. On the basis, the electronic equipment obtains noise power spectrum estimation according to the M target power spectrums and the weighted values. Therefore, through reasonable distribution of the weight occupied by each target power spectrum in the M target power spectrums, the deviation degree of the noise power spectrum estimation relative to the true value of the noise power spectrum can be further reduced by the electronic equipment.

S103, the acquisition device carries out smooth updating processing on the noise power spectrum estimation.

Optionally, the electronic device performs a smooth update process on the noise power spectrum estimation by using a smoothing factor.

And S104, the acquisition device performs compensation and correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal.

Optionally, the electronic device performs compensation and correction processing on the noise power spectrum estimation by using a compensation and correction factor.

It will be appreciated that the purpose of the smoothing update process and the compensation rectification process is to smooth and compensate the noise power spectrum estimate so that it more closely approximates the true value of the noise power spectrum. The values of the smoothing factor and the compensation correction factor can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

The electronic equipment obtains noise power spectrum estimation corresponding to each audio frame in N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums, and then obtains the noise loudness of the target audio signal according to the noise power spectrum estimation through smooth update processing and compensation correction processing. It should be noted that the noise power spectrum estimation obtaining method (i.e. noise loudness obtaining method) in the related art achieves a rough estimation of the noise level in the voice call by tracking the minimum power spectrum band per audio frame. However, even if the compensation factor is added, the noise power spectrum estimation method in the related art still has the defect that the deviation degree of the noise estimation value from the true value is large in the manner of obtaining the noise power spectrum estimation. In order to solve the above-mentioned defects in the related art, the acquisition method provided by the embodiment of the present application performs statistics on two or more target power spectrums, and thereby obtains a noise power spectrum estimate according to the two or more target power spectrums. Therefore, compared with the related art of noise estimation by tracking the minimum power spectrum band of each audio frame, the method provided by the embodiment of the application can effectively reduce the deviation degree between the noise estimation (i.e. the noise loudness obtained by estimation) and the noise true value, and thus improve the call quality of the electronic device.

Optionally, the target audio signal comprises long observation windows and short observation windows alternating with each other.

Illustratively, one of the plurality of long observation windows is adjacent to two of the plurality of short observation windows, respectively. That is, the lengths of the respective observation windows of the target audio signal are alternately long and short.

It will be appreciated that the length of the long viewing window is greater than the length of the short viewing window. The specific lengths of the long observation window and the short observation window can be determined according to actual use requirements, and the embodiment of the application is not limited.

For example, as shown in fig. 3, before S103, the obtaining method provided in the embodiment of the present application may further include the following step S105:

s105, the obtaining device judges the jump level of the noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window.

Alternatively, the effective value of each audio frame may be a Root Mean Square (Root Mean Square) of the spectral values of each audio frame. In other words, the effective value of each audio frame may be the square root of the average of the squares of the spectral values of each audio frame.

Alternatively, the Signal-to-Noise Ratio (SNR) of each audio frame may be a Ratio between the power of the speech Signal and the power of the Noise Signal in each audio frame. Wherein, the signal-to-noise ratio of each audio frame can be expressed by decibels. A higher signal-to-noise ratio for each audio frame indicates less noise in that audio frame.

The effective value and the signal-to-noise ratio of each audio frame can represent the noise condition in the audio frame, so that the change condition of the noise in the target audio signal (namely the jump level of the noise in the target audio signal) can be obtained according to the effective value and the signal-to-noise ratio of each audio frame.

It will be appreciated that the level of jump in the noise in the target audio signal can be used to measure the change in the loudness of the noise.

Alternatively, the transition level of the noise in the target audio signal may include one level or may include a plurality of levels.

Optionally, the hopping level of the electronic device may include one-level hopping and two-level hopping, and may further include three-level hopping and four-level hopping.

Alternatively, the lower the transition level, the greater the degree of change in the noise.

It can be understood that the jump parameter value of the primary jump is larger than that of the secondary jump.

It should be noted that the value of the jump parameter can be represented by a decibel change value or a decibel change rate of the audio signal. For example, when the decibel change value of the audio signal of the primary hopping is greater than the decibel change value of the audio signal of the secondary hopping, or the decibel change rate of the audio signal of the primary hopping is greater than the decibel change rate of the audio signal of the secondary hopping, the hopping parameter value of the primary hopping is greater than the hopping parameter value of the secondary hopping. It is understood that the values of the jump parameter may be characterized by a value, a level, or a percentage. It is understood that the audio signal is an audio signal of an audio frame in which no speech exists. Namely: the audio signal is an audio signal capable of representing the noise degree of the environment where the user of the electronic device is located.

For example, in the case where the hopping level is one-level hopping, it can be understood that the user of the electronic device enters a significantly relatively noisy or significantly relatively quiet environment.

For example, in the case that the hopping level is one-level hopping, it can be understood that the call quality of the user of the electronic device is significantly improved or significantly reduced.

It can be understood that the electronic device in the embodiment of the present application can accurately determine the transition level of the noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window. Furthermore, the electronic device in the embodiment of the present application may determine or select a smooth update processing manner adapted to the transition level according to the transition level, so as to further reduce a deviation between the noise estimation and the noise truth value.

Alternatively, as shown in fig. 3, in the case where the transition level of the noise in the target audio signal is determined by S105, S103 includes the following S1031 to S1032:

and S1031, the obtaining device carries out first smoothing update processing on the noise power spectrum estimation under the condition that the jump level is first-stage jump.

It is to be understood that the first smooth update process may be a smooth update process corresponding to one-level hopping.

Alternatively, in the case where the hopping level is one-step hopping, it can be understood that noise in the target audio signal becomes significantly strong or weak.

Further optionally, the above S1031 includes the following S1031a to S1031 b:

and S1031a, the obtaining device obtains the first smoothing factor according to the audio frame information of the audio frame in the short observation window.

Optionally, in this embodiment of the present application, the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

It will be appreciated that the above signal-to-noise ratio information characterizes the ratio between the power of the speech signal and the power of the noise signal of the audio frame in the short observation window of the target audio signal.

It is to be understood that the above-mentioned speech existence probability information may be information characterizing the possibility of the existence probability of speech in an audio frame in a short observation window of the target audio signal.

Optionally, in this embodiment of the present application, the Voice existence probability information may be obtained by using a Neural Network Voice Activity Detection (NNVAD) manner.

And S1031b, the obtaining device adopts the first smoothing factor to carry out first smoothing updating processing on the noise power spectrum estimation.

The reason why the acquisition apparatus of the embodiment of the present application performs the first smooth update process using S1031a to S1031b is as follows. When the noise loudness estimation is carried out by adopting a minimum tracking mode, the method updates the noise estimation in both a speech section and a non-speech section. However, the noise power spectrum estimation result is susceptible to high signal-to-noise ratio in the speech segment, so that the noise power spectrum estimation result is forced to be raised. In other words, the degree of deviation of the noise estimate of the audio signal of a speech segment is larger and higher than the true noise value, because the signal-to-noise ratio of the audio signal of a speech segment is higher. Therefore, the electronic device according to the embodiment of the present application obtains the first smoothing factor by using S1031a to S1031b in combination with the speech existence probability information and the signal-to-noise ratio information of the audio signal, and performs the first smoothing update process by using the obtained first smoothing factor. Therefore, the electronic device of the embodiment of the application can obviously reduce the deviation degree between the noise estimation and the noise true value.

S1032, under the condition that the jump level is the second-stage jump, the obtaining device carries out second smooth updating processing on the noise power spectrum estimation.

It is to be understood that the second smooth update process may be a smooth update process corresponding to the secondary transition.

Optionally, in the embodiment of the present application, in the case that the hopping level is two-stage hopping, it may be understood that the noise in the target audio signal is strengthened or weakened to a smaller or less obvious degree

Further optionally, S1032 includes S1032a to S1032c described below:

s1032a, fitting by the acquiring device according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window to obtain an initial smoothing factor.

It is understood that the first long observation window and the first short observation window are two adjacent observation windows with alternating lengths.

S1032b, the obtaining device performs superposition fitting on the initial smoothing factor through the audio frame information of the audio frame in the first long observation window to obtain a second smoothing factor.

Optionally, the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

It will be appreciated that the above signal-to-noise ratio characterizes the ratio between the power of the speech signal and the power of the noise signal of an audio frame in a short observation window of the target audio signal.

It is to be understood that the above-mentioned speech existence probability information may be information characterizing the probability of existence of speech in an audio frame in a short observation window of the target audio signal.

Alternatively, the Voice presence probability information may be obtained by means of Neural Network Voice Activity Detection (NNVAD).

S1032c, the obtaining device performs a second smoothing update process on the noise power spectrum estimation by using the second smoothing factor.

The reason why the acquisition apparatus of the embodiment of the present application performs the second smooth update process using S1032a to S1032c is as follows. In the case of a two-level hop level, the electronic device may assume that the user of the electronic device is entering a slightly changing noise field environment (i.e., the noise field environment surrounding the user of the electronic device changes relatively insignificantly). In the above case, the electronic device may first count M target power spectrums (i.e., power spectrums within a small value interval) of each audio frame within the long observation window, and then obtain a noise power spectrum estimate within the long observation window. Then, an initial smoothing factor is fit-calculated based on the noise estimate of the current observation window and the noise power spectrum estimate of the previous observation window (i.e., the first long observation window and the first short observation window). And finally, counting the effective value and the signal-to-noise ratio of each audio frame in the long observation window, calculating the signal-to-noise ratio of the signal in the long observation window, combining the voice existence probability information of the audio signal section in the long observation window, superposing and fitting a second smoothing factor, and updating the current noise power spectrum estimation. Therefore, in the case that the user of the electronic device enters a slightly changing noise field environment, the electronic device of the embodiment of the application may determine or select a smoothing factor adapted to the noise field environment to perform an update process, so as to reduce the deviation between the noise estimation and the noise truth value.

Further exemplarily, as shown in fig. 4, the obtaining method of the loudness of the noise in the audio signal may be implemented by the following S201 to S220:

s201, the acquisition device calculates the sub-band power spectrum acquired by the main microphone.

Wherein the sub-band power spectrum is a log power spectrum.

S202, the obtaining device calculates the effective value, the signal-to-noise ratio, the small value interval and the weight value distribution condition of the current audio frame to obtain noise power spectrum estimation.

Wherein, the small value interval is the small value interval of the N sub-band power spectrums. Namely: the small value interval is M target power spectrums falling into a preset power spectrum interval.

S203, the obtaining device updates effective values, signal-to-noise ratios, noise power spectrum estimation and voice frame identifiers in the long observation window.

S204, the obtaining device updates effective values, signal-to-noise ratios, noise power spectrum estimation and voice frame identifiers in the short observation window.

Wherein the voice frame identifiers in S203 and S204 are used to indicate voice existence probability information.

S205, voting is carried out on each state variable in the short observation window of the acquisition device, and the audio signal attribute of the current observation window and the jump level of the noise are judged.

Wherein the jump level of the noise can be understood as the amplitude of the abrupt change of the background noise. The first order jump represents a noise increase or decrease much, e.g., 10dB and above. The second order jump represents a noise increase or a slight decrease, for example below 10 dB.

If the audio signal attribute of the current observation window changes, recording the audio signal attribute of the current observation window, and outputting an identifier of the steady-state noise state change of the audio signal segment.

S206, the obtaining device judges whether the jump level is one-level jump.

If the determination result is yes, step S207 is executed, and if the determination result is no, step S215 is executed.

Wherein the transition level is determined based on the noise state identifier.

S207, the acquisition device judges whether the background noise is improved.

If the determination result is yes, step S208 is executed, and if the determination result is no, step S210 is executed.

S208, the obtaining device counts the effective value and the signal-to-noise ratio in the short observation window, determines the range of the selected small value interval and the distribution weight coefficient according to the distribution rule of the effective value and the signal-to-noise ratio, and obtains the noise power spectrum estimation of the short observation window section signal.

Wherein the small value interval range can be understood as the range of M target power spectrums.

S209, the obtaining device counts the voice existence probability information and the signal-to-noise ratio information of each audio frame in the short observation window, and adaptively selects a smoothing factor of the steady-state noise power spectrum estimation according to the audio signal characteristics of the current observation window.

S210, the obtaining device judges whether the background noise reduction amplitude is smaller than 20 dB.

If yes, step S211 is executed, and if no, step S214 is executed.

S211, the obtaining device counts the effective value and the signal-to-noise ratio in the short observation window, determines the range of the selected frame small value interval and the distribution weight coefficient according to the distribution rule of the effective value and the signal-to-noise ratio, and obtains the noise power spectrum estimation of the short observation window section signal.

S212, the obtaining device counts the voice existence probability information and the signal-to-noise ratio information of each audio frame in the short observation window, and smoothes a smoothing factor of the steady-state noise power spectrum estimation according to the audio signal characteristics of the current observation window.

S213, the acquiring device updates the noise level of the current window by combining the noise power spectrum estimated value of the previous observation window.

S214, the obtaining device counts the effective value and the signal-to-noise ratio of each audio frame in the short observation window, selects the small value interval of partial audio frames, and distributes the weight coefficient to obtain the noise power spectrum estimation in the short observation window.

S215, the acquisition device counts the small value interval of each audio frame in the long observation window to obtain the noise power spectrum estimation in the long observation window.

S216, the obtaining device calculates the smoothing factor in a fitting mode according to the noise power spectrum estimation of the current observation window and the noise power spectrum estimation of the previous observation window.

S217, the obtaining device counts the effective value and the signal-to-noise ratio of each audio frame in the long observation window to obtain the signal-to-noise ratio of the signal segment in the observation window.

And S218, the obtaining device judges the attribute of the voice signal segment by combining the voice existence probability information of the signal segment in the long observation window and the signal-to-noise ratio information of the voice signal segment, and generates a polymerization smooth window coefficient by superposition.

S219, the obtaining device updates the noise level of the current observation window by combining the noise power spectrum estimation value of the previous observation window.

And S220, the obtaining device outputs the current noise power spectrum estimation value through the compensation correction function.

It can be understood that if the noise of the background environment has a first-order jump, the noise level is obviously stronger or weaker than the past noise level. When entering a background noise field environment with suddenly strengthened, based on various state variables of the audio signal in the short observation window, counting the distribution rule of effective values and signal-to-noise ratios, selecting a small value interval of a part of audio frames, and obtaining a noise power spectrum estimation value of a signal segment in the initial short window. And adaptively selecting a smoothing factor of the noise power spectrum estimation according to the signal-to-noise ratio distribution of the signal in the short observation window and the voice probability of each audio frame, and updating the noise power spectrum estimation of the current signal segment again. When a suddenly weakened background noise field environment is entered (for example, the background noise is reduced by less than 20 dB), the same estimation method is adopted. When the noise power in the short observation window is reduced less (for example, above 20 dB), assuming that the user of the electronic device enters a quieter environment, selecting a small value interval of a new part of audio frames according to the distribution rule of the effective value and the signal-to-noise ratio in the short observation window, and then matching with the weight coefficient to obtain the noise power spectrum estimation value in the short window.

It will be appreciated that if noise in the background environment does not trigger the primary hop identifier, the hop level may be considered a secondary hop (i.e., a relatively slight hop). Under the condition of entering a noise field environment with secondary jump, the electronic equipment firstly counts a small value interval of each audio frame in a long observation window to obtain noise power spectrum estimation in the long observation window. And fitting and calculating a smoothing factor according to the noise power spectrum estimation of the current observation window and the noise power spectrum estimation of the previous observation window. And then, counting the effective value and the signal-to-noise ratio of each audio frame in the long observation window, calculating the signal-to-noise ratio of the audio signal in the long observation window, combining the voice existence probability information of the audio signal section in the long observation window, superposing the fitted smoothing factors to generate a polymerization smoothing window coefficient, and finally updating the current noise power spectrum estimation.

Through the above steps S201 to S220, the electronic device may reduce the deviation between the noise power spectrum estimation and the true value of the noise power spectrum, so as to achieve accurate estimation of the loudness of noise in the audio signal, and thus improve the call quality of the electronic device.

The embodiment of the present application further provides an obtaining apparatus 200 for noise loudness in an audio signal, where the obtaining apparatus 200 may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The acquiring apparatus 200 may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The acquisition device 200 in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The obtaining apparatus 200 provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 4, and is not described herein again to avoid repetition.

As shown in fig. 5, the acquisition apparatus 200 includes: an obtaining module 210 is configured to obtain N subband power spectrums of N audio frames in the target audio signal. The estimating module 220 is configured to obtain, according to the M target power spectrums in each of the N subband power spectrums acquired by the acquiring module 210, a noise power spectrum estimate corresponding to each of the N audio frames. And an updating module 230, configured to perform a smooth updating process on the noise power spectrum estimate obtained by the estimating module 220. And the correcting module 240 is configured to perform compensation correction processing on the noise power spectrum estimation processed by the updating module 230 to obtain the noise loudness of the target audio signal. Wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

Optionally, in this embodiment of the application, the M target power spectrums are power spectrums falling within a preset power spectrum interval. The M target power spectrums may be all or a part of the power spectrums falling within a preset power spectrum interval. The preset power spectrum interval is a set of the first P% or the first Q power spectrums arranged in the sub-band power spectrums from low to high, or the preset power spectrum interval is a set of the last R% or the last S power spectrums arranged in the sub-band power spectrums from high to low, wherein P and R are positive numbers respectively, and Q and S are positive integers respectively.

Optionally, in this embodiment of the present application, the target audio signal includes a long observation window and a short observation window that alternate with each other, as shown in fig. 6, the obtaining apparatus 200 further includes: a decision block 250. The determining module 250 is configured to determine, according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window, a jump level of noise in the target audio signal before the updating module 230 performs the smooth updating process on the noise power spectrum estimation obtained by the estimating module 220. The update module 230 is specifically configured to: and under the condition that the hopping level is one-stage hopping, performing first smoothing update processing on the noise power spectrum estimation obtained by the estimation module 220. And under the condition that the hopping level is the second-level hopping, performing second smooth updating processing on the noise power spectrum estimation obtained by the estimation module 220. And the jumping parameter value of the primary jumping is larger than that of the secondary jumping.

Optionally, in this embodiment of the application, the updating module 230 is specifically configured to: and obtaining a first smoothing factor according to the audio frame information of the audio frame in the short observation window. The noise power spectrum estimation obtained by the estimation module 220 is subjected to a first smoothing update process by using a first smoothing factor. Wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

Optionally, in this embodiment of the application, the updating module 230 is specifically configured to: and fitting to obtain an initial smoothing factor according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window. And performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frame in the first long observation window to obtain a second smoothing factor. And performing second smoothing update processing on the noise power spectrum estimation obtained by the estimation module 220 by using the second smoothing factor. Wherein the audio frame information includes: the signal-to-noise ratio information and the voice existence probability information, and the first long observation window and the first short observation window are adjacent observation windows.

Optionally, in this embodiment of the application, the estimation module 220 is specifically configured to: and respectively endowing each target power spectrum in the M target power spectrums with a weight value. And obtaining noise power spectrum estimation according to the M target power spectrums and the weighted values.

The obtaining device 200 provided in the embodiment of the present application performs statistics on two or more target power spectrums, and thereby obtains a noise power spectrum estimate according to the two or more target power spectrums. Compared with the related art that noise estimation is performed by tracking the minimum power spectrum band per audio frame, the obtaining apparatus 200 provided by the embodiment of the present application can effectively reduce the deviation between the noise estimation (i.e., the noise loudness obtained by estimation) and the noise true value.

As shown in fig. 7, an electronic device 100 is further provided in the embodiment of the present application, and includes a processor 110, a memory 109, and a program or an instruction stored in the memory 109 and executable on the processor 110, where the program or the instruction when executed by the processor 110 implements each process of the obtaining method according to any embodiment of the present application, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic device and the non-mobile electronic device described above.

Fig. 8 is a schematic hardware structure diagram of an electronic device 100 implementing an embodiment of the present application.

The electronic device 100 includes, but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110.

The audio output unit 103 serves as an obtaining module for obtaining N subband power spectrums of N audio frames in the target audio signal. The processor 110 serves as an estimation module, an update module, and a correction module, and is configured to obtain, according to M target power spectrums in each subband power spectrum of the N subband power spectrums obtained by the obtaining module, noise power spectrum estimates corresponding to each audio frame of the N audio frames, perform smooth update processing on the noise power spectrum estimates obtained by the estimation module, and perform compensation correction processing on the noise power spectrum estimates processed by the update module to obtain a noise loudness of the target audio signal. Wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

Optionally, in this embodiment of the present application, the target audio signal includes a long observation window and a short observation window that are alternated mutually, and the processor 110 is further used as a determining module, configured to determine, according to an effective value and a signal-to-noise ratio of each audio frame in the short observation window, a transition level of noise in the target audio signal before the noise power spectrum estimation obtained by the estimating module is subjected to smooth update processing by the updating module. The update module is specifically configured to: and under the condition that the hopping level is first-stage hopping, performing first smoothing updating processing on the noise power spectrum estimation obtained by the estimation module. And under the condition that the hopping level is the second-level hopping, performing second smoothing updating processing on the noise power spectrum estimation obtained by the estimation module. And the jumping parameter value of the primary jumping is larger than that of the secondary jumping.

Optionally, in this embodiment of the application, the processor 110 is used as an update module, and is specifically configured to: and obtaining a first smoothing factor according to the audio frame information of the audio frame in the short observation window. The noise power spectrum estimation obtained by the estimation module 220 is subjected to a first smoothing update process by using a first smoothing factor. Wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

Optionally, in this embodiment of the application, the processor 110 is specifically configured to: and fitting to obtain an initial smoothing factor according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window. And performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frame in the first long observation window to obtain a second smoothing factor. And performing second smoothing update processing on the noise power spectrum estimation obtained by the estimation module by adopting a second smoothing factor. Wherein the audio frame information includes: the signal-to-noise ratio information and the voice existence probability information, and the first long observation window and the first short observation window are adjacent observation windows.

Optionally, in this embodiment of the present application, the processor 110 is used as an estimation module, and is specifically configured to: and respectively endowing each target power spectrum in the M target power spectrums with a weight value. And obtaining noise power spectrum estimation according to the M target power spectrums and the weighted values.

The electronic device 100 provided in the embodiment of the present application performs statistics on two or more target power spectrums, and thus obtains a noise power spectrum estimation according to the two or more target power spectrums. Compared with the related art of noise estimation by tracking the minimum power spectrum band of each audio frame, the electronic device 100 provided by the embodiment of the present application can effectively reduce the deviation degree between the noise estimation (i.e. the noise loudness obtained by estimation) and the noise true value.

Those skilled in the art will appreciate that the electronic device 100 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

It should be understood that, in the embodiment of the present application, the input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 110 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above-mentioned embodiment of the acquisition method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device in the above embodiment. Readable storage media, including computer-readable storage media, such as Read-Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, etc.

An embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement the foregoing methodxxxEach of the method embodimentsThe process can achieve the same technical effect, and is not repeated here to avoid repetition.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for obtaining the loudness of noise in an audio signal, comprising:

acquiring N subband power spectrums of N audio frames in a target audio signal;

obtaining a noise power spectrum estimation corresponding to each audio frame in the N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums;

carrying out smooth updating processing on the noise power spectrum estimation;

performing compensation correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal;

wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

2. The acquisition method according to claim 1, wherein the M target power spectra are power spectra that fall within a preset power spectrum interval; the preset power spectrum interval is a set of the first P% or the first Q power spectrums arranged in the sub-band power spectrums from low to high, or the preset power spectrum interval is a set of the last R% or the last S power spectrums arranged in the sub-band power spectrums from high to low, wherein P and R are positive numbers respectively, and Q and S are positive integers respectively.

3. The method according to claim 1, wherein the target audio signal comprises a long observation window and a short observation window which alternate with each other, and before the smooth update processing of the noise power spectrum estimation, the method further comprises:

judging the jump level of noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window;

the performing smooth update processing on the noise power spectrum estimation includes:

under the condition that the hopping level is first-stage hopping, performing first smoothing updating processing on the noise power spectrum estimation;

under the condition that the jump level is two-stage jump, second smoothing updating processing is carried out on the noise power spectrum estimation;

and the jump parameter value of the primary jump is larger than the jump parameter value of the secondary jump.

4. The method according to claim 3, wherein the performing a first smoothing update process on the noise power spectrum estimation comprises:

obtaining a first smoothing factor according to the audio frame information of the audio frame in the short observation window;

performing the first smoothing update processing on the noise power spectrum estimation by using the first smoothing factor;

wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

5. The method according to claim 3, wherein said performing a second smooth update process on the noise power spectrum estimate comprises:

fitting to obtain an initial smoothing factor according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window;

performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frame in the first long observation window to obtain a second smoothing factor;

performing the second smoothing update processing on the noise power spectrum estimation by adopting the second smoothing factor;

wherein the audio frame information includes: the first long observation window and the first short observation window are adjacent observation windows.

6. The method according to any one of claims 1 to 5, wherein the obtaining, according to the M target power spectrums in each of the N subband power spectrums, the noise power spectrum estimation corresponding to each of the N audio frames includes:

respectively assigning a weight value to each target power spectrum in the M target power spectrums;

and obtaining the noise power spectrum estimation according to the M target power spectrums and the weight values.

7. An apparatus for obtaining the loudness of noise in an audio signal, comprising:

the acquisition module is used for acquiring N sub-band power spectrums of N audio frames in a target audio signal;

an estimation module, configured to obtain, according to M target power spectrums in each of the N subband power spectrums acquired by the acquisition module, noise power spectrum estimates corresponding to each of the N audio frames;

the updating module is used for carrying out smooth updating processing on the noise power spectrum estimation obtained by the estimating module;

the correction module is used for carrying out compensation correction processing on the noise power spectrum estimation processed by the updating module to obtain the noise loudness of the target audio signal;

8. The acquisition apparatus according to claim 7, wherein the M target power spectrums are power spectrums that fall within a preset power spectrum interval; the preset power spectrum interval is a set of the first P% or the first Q power spectrums arranged in the sub-band power spectrums from low to high, or the preset power spectrum interval is a set of the last R% or the last S power spectrums arranged in the sub-band power spectrums from high to low, wherein P and R are positive numbers respectively, and Q and S are positive integers respectively.

9. The apparatus according to claim 7, wherein the target audio signal includes long observation windows and short observation windows alternating with each other, the apparatus further comprising:

a determining module, configured to determine, before the updating module performs smooth updating processing on the noise power spectrum estimation obtained by the estimating module, a jump level of noise in the target audio signal according to an effective value and a signal-to-noise ratio of each audio frame in the short observation window;

the update module is specifically configured to:

under the condition that the hopping level is first-stage hopping, performing first smoothing update processing on the noise power spectrum estimation obtained by the estimation module;

under the condition that the hopping level is second-level hopping, second smoothing updating processing is carried out on the noise power spectrum estimation obtained by the estimation module;

10. The obtaining apparatus according to claim 9, wherein the updating module is specifically configured to:

performing the first smoothing update processing on the noise power spectrum estimation obtained by the estimation module by using the first smoothing factor;

11. The obtaining apparatus according to claim 9, wherein the updating module is specifically configured to:

performing the second smoothing update processing on the noise power spectrum estimation obtained by the estimation module by using the second smoothing factor;

12. The acquisition device according to any one of claims 7 to 11, wherein the estimation module is specifically configured to:

13. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the acquisition method according to any one of claims 1 to 6.

14. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the acquisition method according to any one of claims 1 to 6.