WO2023098103A9

WO2023098103A9 - Audio processing method and audio processing apparatus

Info

Publication number: WO2023098103A9
Application number: PCT/CN2022/107039
Authority: WO
Inventors: 李楠; 张晨
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-12-03
Filing date: 2022-07-21
Publication date: 2023-08-03
Also published as: CN114157254A; WO2023098103A1

Abstract

An audio processing method and an audio processing apparatus. The audio processing method comprises the following steps: acquiring the current audio frame to be processed; determining the energy and the type of the current audio frame, wherein the type comprises one of a voice frame and a non-voice frame; on the basis of the energy and the type of the current audio frame, obtaining voice energy distribution data for the current audio frame, wherein the voice energy distribution data is used for compiling statistics on proportions occupied by voice frames of different energy intervals; according to the voice energy distribution data for the current audio frame, determining a first gain for the current audio frame; and applying the first gain to the current audio frame to obtain a first audio frame.

Description

Audio processing method and audio processing device

technical field

This disclosure is based on a Chinese patent application with application number 202111465600.1 and a filing date of December 3, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The present disclosure relates to the field of audio technology, and in particular, to an audio processing method and an audio processing device for automatic gain control.

Background technique

Automatic Gain Control (AGC) is a key technology in the field of audio processing. It is widely used in real-time communication and other fields. Its basic purpose is to apply different levels of gain to the audio signal according to the volume of the input audio signal, so that the output The volume of the audio signal is stable within a certain range to avoid the sound being too loud or too small due to the difference in the volume of the voice of different speakers or the distance from the device. The AGC technology has high requirements for the volume control capability of the output audio and the sound quality of the processed audio. Among them, the volume control ability is mainly reflected in the gain convergence time (that is, the time required to calculate a reasonable volume audio for a period of stable volume audio) and the gain control range (that is, the range of gain change), and the audio quality is mainly reflected in the objective voice quality evaluation (Perceptual evaluation of speech quality, PESQ) and hearing objective volume analysis (Perceptual Objective Listening Quality Analysis, POLQA) and other objective evaluation indicators. However, it is difficult for the existing AGC technology to balance the audio volume control capability and the processed audio quality.

Contents of the invention

The disclosure provides an audio processing method and an audio processing device.

According to the first aspect of an embodiment of the present disclosure, there is provided an audio processing method, which may include: acquiring a current audio frame to be processed; determining the energy and type of the current audio frame, the type including a speech frame and a non-speech frame One; obtain the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportion of speech frames in different energy intervals; according to the The speech energy distribution data of the current audio frame is used to determine the first gain for the current audio frame; the first gain is applied to the current audio frame to obtain the first audio frame.

In some embodiments, obtaining the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame may include: when the energy of the current audio frame is less than a preset noise threshold or the current When the audio frame is a non-speech frame, the speech energy distribution data of the previous audio frame of the current audio frame is used as the speech energy distribution data of the current audio frame; when the energy of the current audio frame is greater than or equal to the preset Noise threshold and when the current audio frame is a speech frame, update the speech energy distribution data of the previous audio frame based on the energy of the current audio frame, wherein, when the current audio frame is the first frame, based on the The energy of the current audio frame updates the initial speech energy distribution data, and each energy interval of the initial speech energy distribution data uniformly distributes the proportion of the speech frame.

In some embodiments, updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame may include: determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data; Increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame; reduce the speech energy distribution data of the previous audio frame that does not correspond to the determined energy interval The proportion of speech frames in the energy interval of .

In some embodiments, updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame may include: calculating the sum of speech frame ratios of each energy interval in the updated speech energy distribution data; Determine the residual probability by comparing the sum of the speech frame ratios with a preset value; distribute the residual probability to each energy interval of the updated speech energy distribution data until the updated speech energy distribution data The sum of the proportions of the speech frames in each energy interval of is the preset value.

In some embodiments, determining the first gain for the current audio frame according to the speech energy distribution data for the current audio frame may include: from the first gain for the speech energy distribution data for the current audio frame Energy intervals start to accumulate the speech frame ratios of each energy interval in turn until the sum of the accumulation is equal to or greater than the preset threshold; when the sum of the accumulation is equal to the preset threshold, it will be accumulated until the sum of the accumulation is satisfied The upper limit of the energy interval equal to the preset threshold is used as the first energy limit; when the accumulated sum is greater than the preset threshold, it will be accumulated to the energy interval satisfying that the accumulated sum is greater than the preset threshold The lower limit is used as a first energy limit; the first gain is determined according to the target energy of the current audio frame and the first energy limit.

In some embodiments, determining the first gain according to the target energy of the current audio frame and the first energy limit may include: determining according to the target energy of the current audio frame and the first energy limit Determine the initial first gain; determine the frame number corresponding to the current audio frame according to the type of the current audio frame; adjust the initial first gain by comparing the frame number corresponding to the current audio frame with a preset frame number gain and use the adjusted initial first gain as the first gain.

In some embodiments, the audio processing method may further include: determining a second energy limit based on the first gain and the energy of the current audio frame; Limit to determine the initial second gain; based on the audio sampling points in the current audio frame and the initial second gain to obtain a second gain vector; apply the second gain to the first audio frame to obtain a second audio frame.

In some embodiments, obtaining the second gain vector based on the audio sample points in the current audio frame and the initial second gain may include: based on the last audio sample in the previous audio frame of the current audio frame The point gain and the initial second gain respectively calculate the gain for each audio sampling point in the current audio frame, so as to generate the second gain vector.

In some embodiments, applying the second gain to the first audio frame to obtain a second audio frame may include: applying each gain in the second gain vector to the first audio frame respectively corresponding audio sampling points to obtain a second audio frame; and limit the amplitude of the second audio frame.

According to a second aspect of an embodiment of the present disclosure, there is provided an audio processing device, which may include: an acquisition module configured to acquire a current audio frame to be processed; a determination module configured to determine the energy and type of the current audio frame , the type includes one of a speech frame and a non-speech frame; and based on the energy and type of the current audio frame, speech energy distribution data for the current audio frame is obtained, wherein the speech energy distribution data is used to count different energies The proportion of the speech frame of the interval; the first gain module is configured to determine the first gain for the current audio frame according to the speech energy distribution data for the current audio frame; and for the current audio frame The first gain is applied to obtain a first audio frame.

In some embodiments, the determination module may be configured to: when the energy of the current audio frame is less than a preset noise threshold or the current audio frame is a non-speech frame, the speech of the previous audio frame of the current audio frame The energy distribution data is used as the speech energy distribution data of the current audio frame; when the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, based on the current audio frame energy updating the speech energy distribution data of the previous audio frame, wherein, when the current audio frame is the first frame, updating the initial speech energy distribution data based on the energy of the current audio frame, the initial speech energy distribution data The proportion of speech frames that are uniformly distributed in each energy interval.

In some embodiments, the determination module may be configured to: determine the energy range to which the energy of the current audio frame belongs in the speech energy distribution data; increase the difference between the determined speech energy distribution data of the previous audio frame The speech frame ratio of the energy interval corresponding to the energy interval; reducing the speech frame ratio of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.

In some embodiments, the determination module may be configured to: calculate the sum of speech frame ratios of each energy interval in the updated speech energy distribution data; determine by comparing the sum of speech frame ratios with a preset value Residual probability; assigning the residual probability to each energy interval of the updated speech energy distribution data, until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data is the preset set value.

In some embodiments, the first gain module may be configured to: start from the first energy interval of the speech energy distribution data for the current audio frame and sequentially accumulate the speech frame proportions of each energy interval until the accumulated sum Equal to or greater than the preset threshold; when the accumulated sum is equal to the preset threshold, the upper limit of the energy range accumulated to satisfy the accumulated sum equal to the preset threshold is taken as the first energy limit; when the When the accumulated sum is greater than the preset threshold, the lower limit of the energy interval accumulated to satisfy the accumulated sum greater than the preset threshold is taken as the first energy limit; according to the target energy of the current audio frame and the first energy limit An energy bound is used to determine the first gain.

In some embodiments, the first gain module may be configured to: determine an initial first gain according to the target energy of the current audio frame and the first energy limit; determine the current A frame number corresponding to an audio frame; adjusting the initial first gain by comparing the frame number corresponding to the current audio frame with a preset frame number and using the adjusted initial first gain as the first gain.

In some embodiments, the audio processing device may further include a second gain module configured to: determine a second energy limit based on the first gain and the energy of the current audio frame; Determine the initial second gain based on the target energy and the second energy limit; obtain a second gain vector based on the audio sampling points in the current audio frame and the initial second gain; apply the first audio frame to the first audio frame The second gain is obtained to obtain a second audio frame.

In some embodiments, the second gain module may be configured to: calculate the gain for the current audio frame based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively. the gain of each audio sampling point to generate the second gain vector.

In some embodiments, the second gain module may be configured to: apply each gain in the second gain vector to corresponding audio sampling points of the first audio frame to obtain a second audio frame; and Perform clipping processing on the amplitude of the second audio frame.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the When the at least one processor is running, the at least one processor is prompted to execute the audio processing method as described above.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is prompted to execute the audio processing method as described above.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product, where instructions in the computer program product are executed by at least one processor in an electronic device to execute the audio processing method as described above.

The technical solutions provided by the embodiments of the present disclosure bring at least the following beneficial effects:

It can better control the voice volume, and achieve a shorter gain convergence time and a larger gain control range. While ensuring better volume control, it can achieve relatively stable gain, and at the same time ensure higher sound quality audio. In addition, the dynamic preset gain is determined more accurately by using the energy distribution data distribution of the input speech, thereby obtaining speech with higher sound quality.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

Description of drawings

The accompanying drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the description to explain the principle of the disclosure, and do not constitute an improper limitation of the disclosure.

Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present disclosure;

2 is a schematic flow diagram of an audio processing method according to an embodiment of the present disclosure;

3 is a block diagram of an audio processing device according to an embodiment of the present disclosure;

4 is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Throughout the drawings, it should be noted that like reference numbers are used to represent the same or similar elements, features, and structures.

Detailed ways

In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the present disclosure as defined by the claims and their equivalents. Various specific details are included to aid in understanding but are to be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to the bibliographical meanings, but, are merely used by the inventor to enable a clear and consistent understanding of the present disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of various embodiments of the present disclosure are provided for illustration purpose only and not for the purpose of limiting the present disclosure as defined by the claims and their equivalents.

It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.

The existing AGC algorithm can pre-set a fixed higher gain for the input audio and protect the audio with amplitude limitation. However, this solution will cause the amplitude limitation module to affect the audio waveform when a higher gain is applied at a large volume. It produces great distortion and it is difficult to guarantee high sound quality; or, you can refer to the audio energy for a certain period of time and calculate the reasonable gain that needs to be applied to the audio at present. However, due to the short-term and long-term changes in the input audio volume, the The solution usually causes the gain to change too much or the volume response is not sensitive, which will take a long time to get a reasonable gain for a piece of audio, and it is difficult to balance the ability of audio volume control and the quality of the processed audio.

Different from the common scheme of AGC algorithm, the aim is to propose an AGC method combining dynamic preset gain and short-term energy gain control, which can ensure the sound quality while enabling the algorithm to have a strong audio gain control ability and avoid audio gain convergence The speed is slow or the gain fluctuates, etc.

Hereinafter, according to various embodiments of the present disclosure, the method, device, and system of the present disclosure will be described in detail with reference to the accompanying drawings.

FIG. 1 is a flowchart of an audio processing method according to an embodiment of the present disclosure. The audio processing method according to the present disclosure can realize automatic gain control with high sound quality.

The audio processing method according to the present disclosure can be executed by any electronic device having an audio processing function. The electronic device may be at least one of a smart phone, a tablet computer, a laptop computer, a desktop computer, and the like. The electronic device may be installed with a target application for automatic gain control of incoming audio.

Referring to FIG. 1, in step S101, the current audio frame to be processed is acquired. For the input audio to be processed, frame division processing can be performed on the input audio, and then the operation described later is performed for each audio frame. Here, each audio frame may include several audio sample points. For example, an audio frame may contain signal samples over a period of 10-25 ms.

In step S102, the energy and type of the current audio frame are determined, where the type may include one of a speech frame and a non-speech frame. For example, a speech activity detection algorithm may be used to detect whether the current audio frame is a speech frame or a non-speech frame (non-speech frames include noise or silence, etc.).

In step S103, the speech energy distribution data for the current audio frame is obtained based on the energy and type of the current audio frame, wherein the speech energy distribution data can be used to count the proportions of speech frames in different energy intervals. The speech energy distribution data may be expressed in the form of a histogram. For example, the speech energy histogram may indicate the proportion of speech frames in different energy intervals.

In some embodiments, when the energy of the current audio frame is less than the preset noise threshold or the current audio frame is a non-speech frame, the speech energy distribution data of the previous audio frame of the current audio frame can be used as the speech energy distribution data of the current audio frame . When the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, the speech energy distribution data of the previous audio frame of the current audio frame may be updated based on the energy of the current audio frame. Here, the preset noise threshold can be set differently according to actual conditions.

Here, when the current audio frame is the first frame of the input audio, the initial speech energy distribution data may be updated based on the energy of the current audio frame, and the proportion of each energy interval of the initial speech energy distribution data is uniformly distributed by the speech frame. For example, the initial speech energy distribution data can be divided into several energy intervals, the initial probability of each energy interval is set to a uniform distribution, and the sum of the initial probabilities of each energy interval is 1.

When updating the speech energy distribution data for the current audio frame, first determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, and then increase the speech energy distribution data of the previous audio frame and the determined energy interval The speech frame proportion of the corresponding energy interval is reduced by reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.

After increasing or decreasing the speech frame ratio of the corresponding energy interval according to the above method, since it is necessary to ensure that the sum of all probabilities in the speech energy distribution data is 1, it is necessary to calculate the speech frames of each energy interval in the updated speech energy distribution data The sum of the proportions, the residual probability is determined by comparing the sum of the speech frame proportions with the preset value, and then the residual probability is assigned to each energy interval of the updated speech energy distribution data until the updated speech energy The sum of the proportions of speech frames in each energy interval of the distribution data is the preset value. For example, the preset value may be 1, but the disclosure is not limited thereto.

In step S104, a first gain for the current audio frame is determined according to the speech energy distribution data for the current audio frame. In this disclosure, the first gain may also be referred to as a primary gain. As an example, from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than the preset threshold, when the accumulated sum is equal to the preset threshold , the upper limit of the energy interval accumulated to meet the accumulated sum equal to the preset threshold is taken as the first energy limit; when the accumulated sum is greater than the preset threshold, the lower limit of the energy interval accumulated to meet the accumulated sum greater than the preset threshold is taken as The first energy limit.

Next, an initial first gain is determined according to the target energy of the current audio frame and the first energy limit. Determine the frame number corresponding to the current audio frame according to the type of the current audio frame, adjust the initial first gain by comparing the frame number corresponding to the current audio frame with the preset frame number, and use the adjusted initial first gain as the first gain . Here, the preset threshold and the target energy may be set differently according to actual conditions.

In addition, before the initial first gain is adjusted to obtain the first gain, a gain range control may be first performed on the initial first gain so that the initial first gain meets actual requirements.

In step S105, a first gain is applied to the current audio frame to obtain a first audio frame. After the first gain is obtained, the first gain can be applied to the original current audio frame to obtain the first audio signal.

According to an embodiment of the present disclosure, after the first gain is applied to the original current audio frame, the second gain can be applied to the audio frame to which the first gain has been applied, that is, a two-stage gain fusion method is adopted to obtain the final audio signal . In this disclosure, the second gain may also be referred to as a secondary gain.

As an example, the second gain for the current audio frame may be determined based on the first gain and the energy of the current audio frame, and then the second gain may be applied to the first audio frame to obtain the second audio frame.

For example, the second energy limit may be determined based on the first gain and the energy of the current audio frame, and the initial second gain may be determined according to the target energy of the current audio frame and the second energy limit. In addition, smoothing may be performed on the second energy boundary first, and then the initial second gain may be determined using the smoothed second energy boundary and the target energy of the current audio frame. The initial second gain is changed into a second gain vector according to the audio sample point in the current audio frame. Here, the gain for each audio sample point in the current audio frame may be calculated based on the gain of the last audio sample point in the previous audio frame of the current audio frame and the initial second gain, so as to generate the second gain vector. In addition, before generating the second gain vector, a gain range control may be first performed on the initial second gain, so that the initial second gain meets actual requirements.

Next, each gain in the second gain vector can be applied to the corresponding audio samples of the first audio frame to obtain the second audio frame, and then the amplitude of the second audio frame can be clipped to output the final audio Signal. The audio processing method according to an embodiment of the present disclosure will be described in more detail below with reference to FIG. 2 .

Fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, the input audio can be processed by frame division, and then the audio processing method shown in FIG. 2 can be applied to each audio frame of the input audio.

In the audio processing flow of the present disclosure, a voice energy calculation module, a voice activity detection module, a voice energy histogram statistics module, a dynamic prefabricated gain (first-stage gain) calculation module, a second-order gain calculation module and a limiter module can be used to The audio processing method of the present disclosure is implemented.

For example, the voice energy calculation module is used to calculate the energy of the current input audio, the voice activity detection module is used to judge whether the audio at the current time is a speech stage or a non-speech stage (such as noise or silence, etc.), and the speech energy histogram statistics module is used for statistics The voice energy distribution of the past period of time, the dynamic prefabricated gain (first-order gain) calculation module calculates the dynamic prefabricated gain (that is, the first-order gain) that needs to be applied to the input audio at present according to the voice energy distribution data and the voice activity detection result, and The gain is applied to the currently input audio, the second-level gain calculation module further adjusts the audio gain on the basis of the audio with the first-level gain applied, and the limiter module protects the audio from clipping distortion in some extreme cases.

Referring to Figure 2, the input audio is divided into frames, and the current audio frame (assumed to be the nth frame of audio) is represented by x(n), where n∈N, the data contained in each audio frame can be selected within 10-25ms The number of signal sampling points in , that is, x(n) is a vector composed of several lengths of audio sampling points.

Input x(n) into the speech energy calculation module, and the energy of the nth frame of audio can be calculated according to equation (1):

Wherein, M is the number of audio sampling points contained in x(n), the energy unit is dBFS, and the value range of the calculation result can be (-∞,0].

Input x(n) to the voice activity detection module, it can be judged whether the audio frequency of the current nth frame is in the speech stage or the non-speech stage (noise and silence, etc.), and the two states can be represented by the following equation (2) respectively:

Wherein, when vad(n) is 1, it means that the current frame is a speech frame (speech active), when vad(n) is 0, it means that the current frame is a non-speech frame (speech inactive), here, the VAD algorithm is not restricted in any way.

Next, input energyraw(n) and vad(n) into the speech energy histogram statistical module, and the speech energy in the past period of time (can be set differently according to the actual situation) can be counted. The abscissa of the speech energy histogram is For different energy intervals, the width of each energy interval can be 1dB, and its ordinate is the proportion of speech frames in each energy interval for a period of time. The speech energy distribution data of the current nth frame of audio can be represented by HistogramEnergy(n) , the statistical method is as follows.

First, HistogramEnergy(n) can be divided into several energy intervals. Here, the number of energy intervals is 100 and the width of the energy interval is 1dB as an example. However, this disclosure can be adjusted according to actual needs, and is not limited to this. The energy is divided in this way The HistogramEnergy(n) of the interval can be expressed as equation (3):

HistogramEnergy(n)＝[e ₁ (n), e ₂ (n),..., e ₁₀₀ (n)] (3)

The energy corresponding to each energy interval subscript increases in turn, and the corresponding relationship is as follows:

The initial probability (that is, the initial ratio) of each energy interval of the speech energy distribution data can be set to a uniform distribution, taking 100 energy intervals as an example:

When vad(n)=0 or energyraw(n)<noisefloor, it means that the current frame is a non-speech frame or the audio energy is less than the noise threshold noisefloor (this value can be set to -50dBFS, but not limited to this), the current nth The energy of the frame audio can not participate in the statistics of the speech energy distribution data, for example, the speech energy distribution data of the previous audio frame of the current audio frame can be used as the speech energy distribution data of the current audio frame, namely HistogramEnergy (n)=HistogramEnergy (n- 1).

When vad(n)=1 and energyraw(n)≥noisefloor, it means that the current frame is a speech frame, and HistogramEnergy(n) can be updated. For example, it is possible to determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, and increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame of the current audio frame , and reduce the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.

For example, first confirm the energy energyraw(n) of the current audio frame in the energy interval corner mark in equation (3), denoted as ex(n), the way to update the speech energy distribution data can be expressed as equation (4):

Among them, histSmooth is the smoothing factor used for speech energy distribution data statistics. The smoothing factor can be set to 0.95, but it can be adjusted according to the demand and actual situation, or different energyraw(n) can be selected in different energy intervals according to energyraw(n). Parameters and other methods, the above examples are only exemplary, and the present disclosure is not limited thereto.

In addition, since it is necessary to ensure that the sum of all probabilities in HistogramEnergy(n) is 1, it is necessary to calculate the sum of the speech energy distribution data probabilities obtained in the above steps, and calculate the difference between the sum and 1 (ie, the residual probability), and the difference assigned to the entire speech energy distribution data. In some embodiments, the residual probability residualPro(n) may be calculated according to equation (5) below:

After obtaining the residual probability, the residual probability can be assigned according to the following equation (6):

The above steps of allocating residual probability are repeated until residualPro(n)=0, at which point the updating of HistogramEnergy(n) ends.

According to the embodiment of the present disclosure, the speech energy distribution data will be updated consistently from the first frame, and each audio frame of the input audio can update the corresponding speech energy distribution data according to the above equation (6).

Input the updated HistogramEnergy(n), vad(n) and x(n) to the dynamic preset gain (first-stage gain) module, which can integrate the energy distribution information and silence detection information for a period of time to calculate the current Gain gainPre(n) on the current audio frame, so as to obtain the audio xGainPre(n) with dynamic preset gain applied, the calculation method is as follows.

Judging the state of the current audio frame according to the vad(n) information, as shown in the following equation (7):

Here, the status of the current audio frame may refer to the current frame number corresponding to the current audio frame.

Next, starting from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold. When the cumulative sum is equal to the preset threshold, the upper limit of the energy interval that meets the cumulative sum equal to the preset threshold will be used as the first energy limit; and the lower limit of the energy range greater than the preset threshold as the first energy limit.

As an example, the limit energyLevel (ie the first energy limit) where the energy below the percentage threshold probThre (ie the preset threshold) is statistically distributed according to HistogramEnergy (n), that is, the energy of the speech energy distribution data statistically in a period of time is in Below energyLevel, here, probThre can be set to 95%, but not limited to this.

The calculation process can refer to the following method: (i) first set probSum=0; (ii) sequentially accumulate the energy probability in HistogramEnergy(n), that is, probSum=probSum+e _i (n), (i=1,2,… ...,100); (iii) judge the relationship between probSum and probThre each time an energy probability is accumulated, if probSum<probThre, continue to accumulate the next energy probability, if probSum=probThre, then energyLevel is the energy of e _i (n) The upper limit of the interval, and stop the calculation, if probSum>probThre, then energyLevel is the lower limit of the energy interval where e _i (n) is located, and stop the calculation.

Calculate the initial dynamic preset gain (ie, the initial first gain) gainPreRaw based on the energyLevel calculated above, which can be expressed as equation (8):

gainPreRaw＝EnergyTarget-energyLevel (8)

Among them, EnergyTarget is the energy expected to be achieved by the current audio frame, here, it can be set to -18dB, and can also be adjusted according to requirements. The gainPreRaw obtained at this time requires a certain gain range control. According to the actual situation, the gain range is generally [-6dB, 12dB], and it can also be adjusted according to the needs. The gainPreRaw can be adjusted according to the following equation (9):

The dynamic preset gain gainPre(n) (that is, the first gain) to be applied to the current audio is calculated according to the above-calculated initial dynamic preset gain gainPreRaw and silenceState(n).

The initial first gain may be adjusted by comparing the frame number corresponding to the current audio frame with the preset frame number, and the adjusted initial first gain may be used as the first gain. The method can be as follows.

If silenceState(n)≥silThre (wherein, the number of audio frames corresponding to silThre for a period of time, the period of time can be 1 second to 2 seconds), then gainPre(n)=gainPreRaw;

If silenceState(n)<silThre and gainPreRaw≥gainPre(n-1), then gainPre(n)=gainPreRaw×(1-sAtt)+gainPre(n-1)×sAtt, where sAtt is the follow-up smoothing factor, generally It is set to 0.9999, and it can also be set according to the actual situation;

If silenceState(n)<silThre and gainPreRaw<gainPre(n-1), then gainPre(n)=gainPreRaw×(1-sRel)+gainPre(n-1)×sRel, where sRel is the release smoothing factor, which is generally It is set to 0.99, and it can also be set according to the actual situation.

After obtaining the first gain gainPre(n), apply it to the input original audio x(n), as shown in equation (10) below:

An audio signal to which a dynamic preset gain (first-stage gain) is applied can be obtained.

Input gainPre(n), xGainPre(n), energyraw(n) and vad(n) to the secondary gain calculation module to further calculate the second gain gainPost(n) that needs to be applied to the current audio frame at this time. Considering that the gain corresponding to each sampling point in the current audio frame is different, the above gain can be regarded as a gain vector.

In some embodiments, the audio energy (that is, the second energy limit) after the dynamic preset gain is applied can be obtained according to gainPre(n) and energyraw(n), as shown in the following equation (11):

energyGainPreRaw＝gainPre(n)+energyraw(n) (11)

This audio energy is smoothed as shown in equation (12) below:

energyGainPreSmooth(n)＝energyGainPreSmooth(n-1)×smoothEnergy+energyGainPreRaw×(1-smoothEnergy) (12)

Among them, smoothEnergy represents the smoothing factor, energyGainPreSmooth(n) represents the smoothed audio energy of the current audio frame, and energyGainPreSmooth(n-1) represents the smoothed audio energy of the previous audio frame. In case the current audio frame is the first frame, energyGainPreSmooth(n-1) may be set to zero.

According to energyGainPreSmooth(n) and EnergyTarget, the expected value of the secondary gain (ie, the initial second gain) of the current audio frame can be calculated, as shown in the following equation (13):

gainPostRaw＝EnergyTarget-energyGainPreSmooth(n) (13)

Similar to the calculation of gainPreRaw, gainPostRaw also needs to limit the gain range:

The gain vector gainPost(n) of the current audio frame can be obtained according to the above-mentioned gain after gain control processing. Can be expressed as the following form:

gainPost(n)＝[gainPost ₁ (n), gainPost ₂ (n),..., gainPost _M (n)] ^T

xGainPre(n)＝[xGainPre ₁ (n),xGainPre ₂ (n),...,xGainPre _M (n)] ^T

Each element in the current secondary gain vector can be calculated according to the following equation (14):

Wherein, i represents the i-th element in the current secondary gain vector, and gainPost _M (n-1) represents the gain of the Mth sampling point of the previous audio frame (ie, the last sampling point of the previous audio frame).

The secondary gain vector gainPost(n) obtained above is multiplied by the corresponding element of xGainPre(n) (it should be noted that the gain unit needs to be converted from dBFS to linear gain), and the audio signal xGainPost with the secondary gain applied can be obtained (n), as shown in equation (15) below:

The output audio xGainPost(n) obtained above is input to the limiter module to ensure that the audio will not be clipped and distorted, as shown in equation (16) below:

y(n)＝Limiter[xGainPost(n)] (16)

Among them, Limiter[*] indicates that the amplitude of the input signal is limited and protected, and y(n) is the final output of a frame of audio signal after AGC processing.

FIG. 3 is a block diagram of an audio processing device according to an embodiment of the present disclosure.

Referring to FIG. 3 , the audio processing device 300 may include an acquisition module 301 , a determination module 302 , a first gain module 303 , and a second gain module 304 . Each module in the audio processing device 300 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the type of the module. In various embodiments, some modules in the audio processing device 300 may be omitted, or additional modules may also be included. Also, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the corresponding modules/elements before combination.

The acquiring module 301 can acquire the current audio frame to be processed.

The determining module 302 can determine the energy and type of the current audio frame, the type including one of a speech frame and a non-speech frame.

The determining module 302 can obtain speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data can be used to count the proportions of speech frames in different energy intervals.

The first gain module 303 can determine the first gain for the current audio frame according to the speech energy distribution data for the current audio frame, and apply the first gain to the current audio frame to obtain the first audio frame.

The second gain module 304 may determine a second gain for the current audio frame based on the first gain and the energy of the current audio frame, and apply the second gain to the first audio frame to obtain the second audio frame.

When the energy of the current audio frame is less than the preset noise threshold or the current audio frame is a non-speech frame, the determination module 302 may use the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution data of the current audio frame. When the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, the determination module 302 may update the speech energy distribution data of the previous audio frame based on the energy of the current audio frame, wherein the current audio frame is For the first frame, the determining module 302 may update the initial speech energy distribution data based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is evenly distributed with a proportion of the speech frame.

The determination module 302 may determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame, and decrease The ratio of speech frames in the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.

The determination module 302 can calculate the sum of the speech frame proportions of each energy interval in the updated speech energy distribution data, determine the residual probability by comparing the sum of the speech frame proportions with the preset value, and distribute the residual probability to Each energy range of the updated speech energy distribution data, until the sum of the proportions of speech frames in each energy range of the updated speech energy distribution data is a preset value. For example, the preset value can be set to 1.

The first gain module 303 may sequentially accumulate speech frame proportions of each energy interval starting from the first energy interval of the speech energy distribution data of the current audio frame until the accumulated sum is equal to or greater than a preset threshold. When the accumulated sum is equal to the preset threshold, the first gain module 303 may use the upper limit of the energy range accumulated to satisfy the accumulated sum equal to the preset threshold as the first energy limit. When the accumulated sum is greater than the preset threshold, the first gain module 303 may use the lower limit of the energy range that is accumulated to meet the requirement that the accumulated sum is greater than the preset threshold as the first energy limit. Then the first gain module 303 can determine the first gain according to the target energy of the current audio frame and the first energy limit.

The first gain module 303 can determine the initial first gain according to the target energy of the current audio frame and the first energy limit, and determine the frame number corresponding to the current audio frame according to the type of the current audio frame, by comparing the frame number corresponding to the current audio frame with the The preset frame numbers are compared to adjust the initial first gain and the adjusted initial first gain is used as the first gain.

The second gain module 304 may determine the second energy limit based on the first gain and the energy of the current audio frame, determine the initial second gain according to the target energy of the current audio frame and the second energy limit, and determine the audio sampling point in the current audio frame Change the initial second gain to the second gain vector.

The second gain module 304 can calculate the gain for each audio sampling point in the current audio frame based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, so as to generate a second gain vector .

The second gain module 304 can apply each gain in the second gain vector to corresponding audio sampling points of the first audio frame to obtain a second audio frame, and perform clipping processing on the amplitude of the second audio frame.

The automatic gain control process according to the embodiment of the present disclosure has been described in detail above with reference to FIG. 1 and FIG. 2 , and will not be described here again.

Fig. 4 is a schematic structural diagram of an audio processing device in a hardware operating environment according to an embodiment of the present disclosure.

As shown in FIG. 4 , the audio processing device 400 may include: a processing component 401 , a communication bus 402 , a network interface 403 , an input and output interface 404 , a memory 405 and a power supply component 404 . Wherein, the communication bus 402 is used to realize connection and communication between these components. The input and output interface 404 may include a video display (such as a liquid crystal display), a microphone and a speaker, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and in some embodiments, the input and output interface 404 may also include a standard Wired interface, wireless interface. The network interface 403 may include a standard wired interface and a wireless interface (such as a Wi-Fi interface). The memory 405 can be a high-speed random access memory, or a stable non-volatile memory. The memory 405 may also be a storage device independent of the aforementioned processing component 401.

Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation to the audio processing device 400, and may include more or less components than shown in the figure, or combine some components, or arrange different components.

As shown in FIG. 4 , memory 405 as a storage medium may include an operating system (such as MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.

In the audio processing device 400 shown in Figure 4, the network interface 403 is mainly used for data communication with external electronic devices/terminals; the input and output interface 404 is mainly used for data interaction with the user; the processing component 401 in the audio processing device 400 , the memory 405 can be set in the audio processing device 400, and the audio processing device 400 calls the audio processing programs, materials and various APIs stored in the memory 405 through the processing component 401, and executes the audio processing provided by the embodiment of the present disclosure. Approach.

The processing component 401 may include at least one processor, and a set of computer-executable instructions is stored in the memory 405. When the set of computer-executable instructions is executed by the at least one processor, the audio processing method according to the embodiment of the present disclosure is executed. However, the above-mentioned examples are only exemplary, and the present disclosure is not limited thereto.

The processing component 401 can obtain the current audio frame to be processed, determine the energy and type of the current audio frame, obtain the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, and obtain the speech energy distribution data for the current audio frame according to the speech energy distribution for the current audio frame data to determine a first gain for the current audio frame, apply the first gain to the current audio frame to obtain the first audio frame, and then determine a second gain for the current audio frame based on the first gain and the energy of the current audio frame Gain, to apply a second gain to the first audio frame to obtain a second audio frame.

The processing component 401 can realize control of components included in the audio processing device 400 by executing a program.

The audio processing device 400 may receive or output video and/or audio via the input-output interface 404 . For example, the audio processing device 400 can output the audio signal after the gain is applied via the input and output interface 404 .

As an example, the audio processing device 400 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above-mentioned set of instructions. Here, the audio processing device 400 does not have to be a single electronic device, but can also be any assembly of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). The audio processing device 400 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).

In the audio processing device 400, the processing component 401 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller or a microprocessor. By way of example and not limitation, the processing component 401 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Processing component 401 may execute instructions or codes stored in memory, where memory 405 may also store data. Instructions and data can also be sent and received via the network via the network interface 403, wherein the network interface 403 can adopt any known transmission protocol.

The memory 405 may be integrated with the processing component 401, for example, by placing RAM or flash memory within an integrated circuit microprocessor or the like. Additionally, storage 405 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. The memory and processing component 401 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., such that the processing component 401 can read data stored in the memory 405 .

According to an embodiment of the present disclosure, there may be provided an electronic device. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure. The electronic device 500 may include at least one memory 502 and at least one processor 501. The at least one memory 502 stores a set of computer-executable instructions. When the computer-executable instructions When the set is executed by at least one processor 501, the audio processing method according to the embodiment of the present disclosure is executed.

Processor 501 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, the processor 501 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The memory 502 as a storage medium may include an operating system (eg, MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.

The memory 502 can be integrated with the processor 501, for example, RAM or flash memory can be arranged in an integrated circuit microprocessor or the like. Additionally, memory 502 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. Memory 502 and processor 501 may be operatively coupled, or may communicate with each other, such as through an I/O port, network connection, etc., such that processor 501 can read files stored in memory 502 .

In addition, the electronic device 500 may further include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.

Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation, and may include more or less components than shown in the figure, or combine some components, or arrange different components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein, when the instructions are executed by at least one processor, at least one processor is prompted to execute the audio processing method according to the present disclosure. Examples of computer readable storage media herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Flash Memory, Non-volatile Memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Memory, Hard Disk Drive (HDD), Solid State Hard disks (SSD), memory cards (such as MultiMediaCards, Secure Digital (SD) or Extreme Digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other means configured to store a computer program and any associated data, data files and data structures in a non-transitory manner and to provide said computer program and any associated data, data files and data structures to the processor or the computer to enable the processor or the computer to execute the computer program. The computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. In addition, in one example, the computer program and any associated data and data files and data structures are distributed over network-connected computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, a computer program product may also be provided, and instructions in the computer program product may be executed by a processor of a computer device to complete the above audio processing method.

Other embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any modification, use or adaptation of the present disclosure, and these modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The specification and examples are to be considered exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

An audio processing method comprising:

Get the current audio frame to be processed;

determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;

Obtain speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, where the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;

determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;

Applying the first gain to the current audio frame to obtain a first audio frame.
The audio processing method according to claim 1, wherein obtaining the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame comprises:

In response to the energy of the current audio frame being less than a preset noise threshold or the current audio frame is a non-speech frame, using the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution of the current audio frame data;

updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame in response to the energy of the current audio frame being greater than or equal to the preset noise threshold and the current audio frame being a speech frame,

Wherein, in response to the fact that the current audio frame is the first frame, the initial speech energy distribution data is updated based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is uniformly distributed with a proportion of the speech frame.
The audio processing method according to claim 2, wherein updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame comprises:

determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data;

increasing the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame;

reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
The audio processing method according to claim 2 or 3, wherein updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame comprises:

Calculating the sum of the speech frame ratios of each energy interval in the updated speech energy distribution data;

determining the residual probability by comparing the sum of the speech frame ratios with a preset value;

Allocating the residual probability to each energy interval of the updated speech energy distribution data until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data reaches the preset value.
The audio processing method according to claim 1, wherein determining the first gain for the current audio frame according to the speech energy distribution data for the current audio frame comprises:

Starting from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold;

In response to the accumulated sum being equal to the preset threshold, taking the upper limit of the energy interval accumulated to satisfy the accumulated sum being equal to the preset threshold as a first energy limit;

In response to the accumulated sum being greater than the preset threshold, taking the lower limit of the energy interval accumulated to satisfy the accumulated sum being greater than the preset threshold as the first energy limit;

The first gain is determined according to the target energy of the current audio frame and the first energy limit.
The audio processing method according to claim 5, wherein determining the first gain according to the target energy of the current audio frame and the first energy limit comprises:

determining an initial first gain according to the target energy of the current audio frame and the first energy limit;

determining the number of frames corresponding to the current audio frame according to the type of the current audio frame;

The initial first gain is adjusted by comparing the frame number corresponding to the current audio frame with a preset frame number, and the adjusted initial first gain is used as the first gain.
The audio processing method according to claim 1, further comprising:

determining a second energy bound based on the first gain and the energy of the current audio frame;

determining an initial second gain based on the target energy of the current audio frame and the second energy limit;

obtaining a second gain vector based on the audio sampling points in the current audio frame and the initial second gain;

Applying the second gain to the first audio frame to obtain a second audio frame.
The audio processing method according to claim 7, wherein the second gain vector is obtained based on the audio sampling point in the current audio frame and the initial second gain, comprising:

Based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively calculate the gain for each audio sampling point in the current audio frame to generate the second gain vector.
The audio processing method according to claim 7, wherein applying the second gain to the first audio frame to obtain a second audio frame comprises:

applying each gain in the second gain vector to corresponding audio samples of the first audio frame to obtain a second audio frame; and

Perform clipping processing on the amplitude of the second audio frame.
An audio processing device, comprising:

An acquisition module configured to acquire the current audio frame to be processed;

A determination module configured to determine energy and a type of the current audio frame, the type including one of a speech frame and a non-speech frame; and obtain a value for the current audio frame based on the energy and type of the current audio frame Speech energy distribution data, wherein the speech energy distribution data is used to count the proportion of speech frames in different energy intervals;

A first gain module configured to determine a first gain for the current audio frame according to speech energy distribution data for the current audio frame; and apply the first gain to the current audio frame to obtain a first gain An audio frame.
The audio processing device according to claim 10, wherein the determining module is configured to:

In response to the energy of the current audio frame being less than a preset noise threshold or the current audio frame is a non-speech frame, using the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution of the current audio frame data;

updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame in response to the energy of the current audio frame being greater than or equal to the preset noise threshold and the current audio frame being a speech frame,

Wherein, in response to the fact that the current audio frame is the first frame, the initial speech energy distribution data is updated based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is uniformly distributed with a proportion of the speech frame.
The audio processing device according to claim 11, wherein the determining module is configured to:

determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data;

increasing the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame;

reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
The audio processing device according to claim 11 or 12, wherein the determining module is configured to:

Calculating the sum of the speech frame ratios of each energy interval in the updated speech energy distribution data;

determining the residual probability by comparing the sum of the speech frame ratios with a preset value;

Allocating the residual probability to each energy interval of the updated speech energy distribution data until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data reaches the preset value.
The audio processing device according to claim 10, wherein the first gain module is configured as:

Starting from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold;

In response to the accumulated sum being equal to the preset threshold, taking the upper limit of the energy interval accumulated to satisfy the accumulated sum being equal to the preset threshold as a first energy limit;

In response to the accumulated sum being greater than the preset threshold, taking the lower limit of the energy interval accumulated to satisfy the accumulated sum being greater than the preset threshold as the first energy limit;

The first gain is determined according to the target energy of the current audio frame and the first energy limit.
The audio processing device according to claim 14, wherein the first gain module is configured to:

determining an initial first gain according to the target energy of the current audio frame and the first energy limit;

determining the number of frames corresponding to the current audio frame according to the type of the current audio frame;

The initial first gain is adjusted by comparing the frame number corresponding to the current audio frame with a preset frame number, and the adjusted initial first gain is used as the first gain.
The audio processing device according to claim 10, further comprising a second gain module configured to:

determining a second energy bound based on the first gain and the energy of the current audio frame;

determining an initial second gain based on the target energy of the current audio frame and the second energy limit;

obtaining a second gain vector based on the audio sampling points in the current audio frame and the initial second gain;

Applying the second gain to the first audio frame to obtain a second audio frame.
The audio processing device according to claim 16, wherein the second gain module is configured to:

Based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively calculate the gain for each audio sampling point in the current audio frame to generate the second gain vector.
The audio processing device according to claim 16, wherein the second gain module is configured to:

applying each gain in the second gain vector to corresponding audio samples of the first audio frame to obtain a second audio frame; and

Perform clipping processing on the amplitude of the second audio frame.
An electronic device, characterized in that it comprises:

at least one processor;

at least one memory storing computer-executable instructions,

Wherein, the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the following steps:

Get the current audio frame to be processed;

determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;

Acquiring speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;

determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;

Applying the first gain to the current audio frame to obtain a first audio frame.
A computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one processor, the at least one processor is prompted to perform the following steps:

Get the current audio frame to be processed;

determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;

Acquiring speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;

determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;

Applying the first gain to the current audio frame to obtain a first audio frame.
A computer program product in which instructions are executed by at least one processor in an electronic device to perform the following steps:

Get the current audio frame to be processed;

determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;

Acquiring speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;

determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;

Applying the first gain to the current audio frame to obtain a first audio frame.