WO2023098103A9 - Audio processing method and audio processing apparatus - Google Patents

Audio processing method and audio processing apparatus Download PDF

Info

Publication number
WO2023098103A9
WO2023098103A9 PCT/CN2022/107039 CN2022107039W WO2023098103A9 WO 2023098103 A9 WO2023098103 A9 WO 2023098103A9 CN 2022107039 W CN2022107039 W CN 2022107039W WO 2023098103 A9 WO2023098103 A9 WO 2023098103A9
Authority
WO
WIPO (PCT)
Prior art keywords
audio frame
energy
speech
gain
frame
Prior art date
Application number
PCT/CN2022/107039
Other languages
French (fr)
Chinese (zh)
Other versions
WO2023098103A1 (en
Inventor
李楠
张晨
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2023098103A1 publication Critical patent/WO2023098103A1/en
Publication of WO2023098103A9 publication Critical patent/WO2023098103A9/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03GCONTROL OF AMPLIFICATION
    • H03G3/00Gain control in amplifiers or frequency changers
    • H03G3/20Automatic control
    • H03G3/30Automatic control in amplifiers having semiconductor devices
    • H03G3/3005Automatic control in amplifiers having semiconductor devices in amplifiers suitable for low-frequencies, e.g. audio amplifiers

Definitions

  • the present disclosure relates to the field of audio technology, and in particular, to an audio processing method and an audio processing device for automatic gain control.
  • AGC Automatic Gain Control
  • the volume control ability is mainly reflected in the gain convergence time (that is, the time required to calculate a reasonable volume audio for a period of stable volume audio) and the gain control range (that is, the range of gain change), and the audio quality is mainly reflected in the objective voice quality evaluation (Perceptual evaluation of speech quality, PESQ) and hearing objective volume analysis (Perceptual Objective Listening Quality Analysis, POLQA) and other objective evaluation indicators.
  • the objective voice quality evaluation Perceptual evaluation of speech quality, PESQ
  • POLQA Perceptual Objective Listening Quality Analysis
  • the disclosure provides an audio processing method and an audio processing device.
  • an audio processing method which may include: acquiring a current audio frame to be processed; determining the energy and type of the current audio frame, the type including a speech frame and a non-speech frame One; obtain the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportion of speech frames in different energy intervals; according to the The speech energy distribution data of the current audio frame is used to determine the first gain for the current audio frame; the first gain is applied to the current audio frame to obtain the first audio frame.
  • obtaining the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame may include: when the energy of the current audio frame is less than a preset noise threshold or the current When the audio frame is a non-speech frame, the speech energy distribution data of the previous audio frame of the current audio frame is used as the speech energy distribution data of the current audio frame; when the energy of the current audio frame is greater than or equal to the preset Noise threshold and when the current audio frame is a speech frame, update the speech energy distribution data of the previous audio frame based on the energy of the current audio frame, wherein, when the current audio frame is the first frame, based on the The energy of the current audio frame updates the initial speech energy distribution data, and each energy interval of the initial speech energy distribution data uniformly distributes the proportion of the speech frame.
  • updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame may include: determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data; Increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame; reduce the speech energy distribution data of the previous audio frame that does not correspond to the determined energy interval The proportion of speech frames in the energy interval of .
  • updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame may include: calculating the sum of speech frame ratios of each energy interval in the updated speech energy distribution data; Determine the residual probability by comparing the sum of the speech frame ratios with a preset value; distribute the residual probability to each energy interval of the updated speech energy distribution data until the updated speech energy distribution data The sum of the proportions of the speech frames in each energy interval of is the preset value.
  • determining the first gain for the current audio frame according to the speech energy distribution data for the current audio frame may include: from the first gain for the speech energy distribution data for the current audio frame Energy intervals start to accumulate the speech frame ratios of each energy interval in turn until the sum of the accumulation is equal to or greater than the preset threshold; when the sum of the accumulation is equal to the preset threshold, it will be accumulated until the sum of the accumulation is satisfied The upper limit of the energy interval equal to the preset threshold is used as the first energy limit; when the accumulated sum is greater than the preset threshold, it will be accumulated to the energy interval satisfying that the accumulated sum is greater than the preset threshold The lower limit is used as a first energy limit; the first gain is determined according to the target energy of the current audio frame and the first energy limit.
  • determining the first gain according to the target energy of the current audio frame and the first energy limit may include: determining according to the target energy of the current audio frame and the first energy limit Determine the initial first gain; determine the frame number corresponding to the current audio frame according to the type of the current audio frame; adjust the initial first gain by comparing the frame number corresponding to the current audio frame with a preset frame number gain and use the adjusted initial first gain as the first gain.
  • the audio processing method may further include: determining a second energy limit based on the first gain and the energy of the current audio frame; Limit to determine the initial second gain; based on the audio sampling points in the current audio frame and the initial second gain to obtain a second gain vector; apply the second gain to the first audio frame to obtain a second audio frame.
  • obtaining the second gain vector based on the audio sample points in the current audio frame and the initial second gain may include: based on the last audio sample in the previous audio frame of the current audio frame The point gain and the initial second gain respectively calculate the gain for each audio sampling point in the current audio frame, so as to generate the second gain vector.
  • applying the second gain to the first audio frame to obtain a second audio frame may include: applying each gain in the second gain vector to the first audio frame respectively corresponding audio sampling points to obtain a second audio frame; and limit the amplitude of the second audio frame.
  • an audio processing device which may include: an acquisition module configured to acquire a current audio frame to be processed; a determination module configured to determine the energy and type of the current audio frame , the type includes one of a speech frame and a non-speech frame; and based on the energy and type of the current audio frame, speech energy distribution data for the current audio frame is obtained, wherein the speech energy distribution data is used to count different energies The proportion of the speech frame of the interval; the first gain module is configured to determine the first gain for the current audio frame according to the speech energy distribution data for the current audio frame; and for the current audio frame The first gain is applied to obtain a first audio frame.
  • the determination module may be configured to: when the energy of the current audio frame is less than a preset noise threshold or the current audio frame is a non-speech frame, the speech of the previous audio frame of the current audio frame
  • the energy distribution data is used as the speech energy distribution data of the current audio frame; when the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, based on the current audio frame energy updating the speech energy distribution data of the previous audio frame, wherein, when the current audio frame is the first frame, updating the initial speech energy distribution data based on the energy of the current audio frame, the initial speech energy distribution data
  • the determination module may be configured to: determine the energy range to which the energy of the current audio frame belongs in the speech energy distribution data; increase the difference between the determined speech energy distribution data of the previous audio frame The speech frame ratio of the energy interval corresponding to the energy interval; reducing the speech frame ratio of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
  • the determination module may be configured to: calculate the sum of speech frame ratios of each energy interval in the updated speech energy distribution data; determine by comparing the sum of speech frame ratios with a preset value Residual probability; assigning the residual probability to each energy interval of the updated speech energy distribution data, until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data is the preset set value.
  • the first gain module may be configured to: start from the first energy interval of the speech energy distribution data for the current audio frame and sequentially accumulate the speech frame proportions of each energy interval until the accumulated sum Equal to or greater than the preset threshold; when the accumulated sum is equal to the preset threshold, the upper limit of the energy range accumulated to satisfy the accumulated sum equal to the preset threshold is taken as the first energy limit; when the When the accumulated sum is greater than the preset threshold, the lower limit of the energy interval accumulated to satisfy the accumulated sum greater than the preset threshold is taken as the first energy limit; according to the target energy of the current audio frame and the first energy limit An energy bound is used to determine the first gain.
  • the first gain module may be configured to: determine an initial first gain according to the target energy of the current audio frame and the first energy limit; determine the current A frame number corresponding to an audio frame; adjusting the initial first gain by comparing the frame number corresponding to the current audio frame with a preset frame number and using the adjusted initial first gain as the first gain.
  • the audio processing device may further include a second gain module configured to: determine a second energy limit based on the first gain and the energy of the current audio frame; Determine the initial second gain based on the target energy and the second energy limit; obtain a second gain vector based on the audio sampling points in the current audio frame and the initial second gain; apply the first audio frame to the first audio frame The second gain is obtained to obtain a second audio frame.
  • a second gain module configured to: determine a second energy limit based on the first gain and the energy of the current audio frame; Determine the initial second gain based on the target energy and the second energy limit; obtain a second gain vector based on the audio sampling points in the current audio frame and the initial second gain; apply the first audio frame to the first audio frame The second gain is obtained to obtain a second audio frame.
  • the second gain module may be configured to: calculate the gain for the current audio frame based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively. the gain of each audio sampling point to generate the second gain vector.
  • the second gain module may be configured to: apply each gain in the second gain vector to corresponding audio sampling points of the first audio frame to obtain a second audio frame; and Perform clipping processing on the amplitude of the second audio frame.
  • an electronic device may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the When the at least one processor is running, the at least one processor is prompted to execute the audio processing method as described above.
  • a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is prompted to execute the audio processing method as described above.
  • a computer program product where instructions in the computer program product are executed by at least one processor in an electronic device to execute the audio processing method as described above.
  • the voice volume can better control the voice volume, and achieve a shorter gain convergence time and a larger gain control range. While ensuring better volume control, it can achieve relatively stable gain, and at the same time ensure higher sound quality audio.
  • the dynamic preset gain is determined more accurately by using the energy distribution data distribution of the input speech, thereby obtaining speech with higher sound quality.
  • Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present disclosure
  • FIG. 2 is a schematic flow diagram of an audio processing method according to an embodiment of the present disclosure
  • FIG. 3 is a block diagram of an audio processing device according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure.
  • FIG. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure.
  • the existing AGC algorithm can pre-set a fixed higher gain for the input audio and protect the audio with amplitude limitation.
  • this solution will cause the amplitude limitation module to affect the audio waveform when a higher gain is applied at a large volume. It produces great distortion and it is difficult to guarantee high sound quality; or, you can refer to the audio energy for a certain period of time and calculate the reasonable gain that needs to be applied to the audio at present.
  • the The solution usually causes the gain to change too much or the volume response is not sensitive, which will take a long time to get a reasonable gain for a piece of audio, and it is difficult to balance the ability of audio volume control and the quality of the processed audio.
  • the aim is to propose an AGC method combining dynamic preset gain and short-term energy gain control, which can ensure the sound quality while enabling the algorithm to have a strong audio gain control ability and avoid audio gain convergence
  • the speed is slow or the gain fluctuates, etc.
  • FIG. 1 is a flowchart of an audio processing method according to an embodiment of the present disclosure.
  • the audio processing method according to the present disclosure can realize automatic gain control with high sound quality.
  • the audio processing method according to the present disclosure can be executed by any electronic device having an audio processing function.
  • the electronic device may be at least one of a smart phone, a tablet computer, a laptop computer, a desktop computer, and the like.
  • the electronic device may be installed with a target application for automatic gain control of incoming audio.
  • step S101 the current audio frame to be processed is acquired.
  • frame division processing can be performed on the input audio, and then the operation described later is performed for each audio frame.
  • each audio frame may include several audio sample points.
  • an audio frame may contain signal samples over a period of 10-25 ms.
  • step S102 the energy and type of the current audio frame are determined, where the type may include one of a speech frame and a non-speech frame.
  • a speech activity detection algorithm may be used to detect whether the current audio frame is a speech frame or a non-speech frame (non-speech frames include noise or silence, etc.).
  • the speech energy distribution data for the current audio frame is obtained based on the energy and type of the current audio frame, wherein the speech energy distribution data can be used to count the proportions of speech frames in different energy intervals.
  • the speech energy distribution data may be expressed in the form of a histogram.
  • the speech energy histogram may indicate the proportion of speech frames in different energy intervals.
  • the speech energy distribution data of the previous audio frame of the current audio frame can be used as the speech energy distribution data of the current audio frame .
  • the speech energy distribution data of the previous audio frame of the current audio frame may be updated based on the energy of the current audio frame.
  • the preset noise threshold can be set differently according to actual conditions.
  • the initial speech energy distribution data may be updated based on the energy of the current audio frame, and the proportion of each energy interval of the initial speech energy distribution data is uniformly distributed by the speech frame.
  • the initial speech energy distribution data can be divided into several energy intervals, the initial probability of each energy interval is set to a uniform distribution, and the sum of the initial probabilities of each energy interval is 1.
  • the speech energy distribution data for the current audio frame When updating the speech energy distribution data for the current audio frame, first determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, and then increase the speech energy distribution data of the previous audio frame and the determined energy interval The speech frame proportion of the corresponding energy interval is reduced by reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
  • the preset value may be 1, but the disclosure is not limited thereto.
  • a first gain for the current audio frame is determined according to the speech energy distribution data for the current audio frame.
  • the first gain may also be referred to as a primary gain.
  • the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than the preset threshold, when the accumulated sum is equal to the preset threshold , the upper limit of the energy interval accumulated to meet the accumulated sum equal to the preset threshold is taken as the first energy limit; when the accumulated sum is greater than the preset threshold, the lower limit of the energy interval accumulated to meet the accumulated sum greater than the preset threshold is taken as The first energy limit.
  • an initial first gain is determined according to the target energy of the current audio frame and the first energy limit. Determine the frame number corresponding to the current audio frame according to the type of the current audio frame, adjust the initial first gain by comparing the frame number corresponding to the current audio frame with the preset frame number, and use the adjusted initial first gain as the first gain .
  • the preset threshold and the target energy may be set differently according to actual conditions.
  • a gain range control may be first performed on the initial first gain so that the initial first gain meets actual requirements.
  • step S105 a first gain is applied to the current audio frame to obtain a first audio frame. After the first gain is obtained, the first gain can be applied to the original current audio frame to obtain the first audio signal.
  • the second gain can be applied to the audio frame to which the first gain has been applied, that is, a two-stage gain fusion method is adopted to obtain the final audio signal .
  • the second gain may also be referred to as a secondary gain.
  • the second gain for the current audio frame may be determined based on the first gain and the energy of the current audio frame, and then the second gain may be applied to the first audio frame to obtain the second audio frame.
  • the second energy limit may be determined based on the first gain and the energy of the current audio frame, and the initial second gain may be determined according to the target energy of the current audio frame and the second energy limit.
  • smoothing may be performed on the second energy boundary first, and then the initial second gain may be determined using the smoothed second energy boundary and the target energy of the current audio frame.
  • the initial second gain is changed into a second gain vector according to the audio sample point in the current audio frame.
  • the gain for each audio sample point in the current audio frame may be calculated based on the gain of the last audio sample point in the previous audio frame of the current audio frame and the initial second gain, so as to generate the second gain vector.
  • a gain range control may be first performed on the initial second gain, so that the initial second gain meets actual requirements.
  • each gain in the second gain vector can be applied to the corresponding audio samples of the first audio frame to obtain the second audio frame, and then the amplitude of the second audio frame can be clipped to output the final audio Signal.
  • the audio processing method according to an embodiment of the present disclosure will be described in more detail below with reference to FIG. 2 .
  • Fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present disclosure.
  • the input audio can be processed by frame division, and then the audio processing method shown in FIG. 2 can be applied to each audio frame of the input audio.
  • a voice energy calculation module a voice activity detection module, a voice energy histogram statistics module, a dynamic prefabricated gain (first-stage gain) calculation module, a second-order gain calculation module and a limiter module can be used to The audio processing method of the present disclosure is implemented.
  • the voice energy calculation module is used to calculate the energy of the current input audio
  • the voice activity detection module is used to judge whether the audio at the current time is a speech stage or a non-speech stage (such as noise or silence, etc.)
  • the speech energy histogram statistics module is used for statistics
  • the voice energy distribution of the past period of time the dynamic prefabricated gain (first-order gain) calculation module calculates the dynamic prefabricated gain (that is, the first-order gain) that needs to be applied to the input audio at present according to the voice energy distribution data and the voice activity detection result, and The gain is applied to the currently input audio
  • the second-level gain calculation module further adjusts the audio gain on the basis of the audio with the first-level gain applied
  • the limiter module protects the audio from clipping distortion in some extreme cases.
  • the input audio is divided into frames, and the current audio frame (assumed to be the nth frame of audio) is represented by x(n), where n ⁇ N, the data contained in each audio frame can be selected within 10-25ms
  • the number of signal sampling points in , that is, x(n) is a vector composed of several lengths of audio sampling points.
  • M is the number of audio sampling points contained in x(n)
  • the energy unit is dBFS
  • the value range of the calculation result can be (- ⁇ ,0].
  • Input x(n) to the voice activity detection module it can be judged whether the audio frequency of the current nth frame is in the speech stage or the non-speech stage (noise and silence, etc.), and the two states can be represented by the following equation (2) respectively:
  • vad(n) when vad(n) is 1, it means that the current frame is a speech frame (speech active), when vad(n) is 0, it means that the current frame is a non-speech frame (speech inactive), here, the VAD algorithm is not restricted in any way.
  • the abscissa of the speech energy histogram is For different energy intervals, the width of each energy interval can be 1dB, and its ordinate is the proportion of speech frames in each energy interval for a period of time.
  • the speech energy distribution data of the current nth frame of audio can be represented by HistogramEnergy(n) , the statistical method is as follows.
  • HistogramEnergy(n) can be divided into several energy intervals.
  • the number of energy intervals is 100 and the width of the energy interval is 1dB as an example.
  • this disclosure can be adjusted according to actual needs, and is not limited to this.
  • the energy is divided in this way
  • the HistogramEnergy(n) of the interval can be expressed as equation (3):
  • HistogramEnergy(n) [e 1 (n), e 2 (n),..., e 100 (n)] (3)
  • the initial probability (that is, the initial ratio) of each energy interval of the speech energy distribution data can be set to a uniform distribution, taking 100 energy intervals as an example:
  • histSmooth is the smoothing factor used for speech energy distribution data statistics.
  • the smoothing factor can be set to 0.95, but it can be adjusted according to the demand and actual situation, or different energyraw(n) can be selected in different energy intervals according to energyraw(n). Parameters and other methods, the above examples are only exemplary, and the present disclosure is not limited thereto.
  • the residual probability residualPro(n) may be calculated according to equation (5) below:
  • the residual probability can be assigned according to the following equation (6):
  • the speech energy distribution data will be updated consistently from the first frame, and each audio frame of the input audio can update the corresponding speech energy distribution data according to the above equation (6).
  • the dynamic preset gain (first-stage gain) module which can integrate the energy distribution information and silence detection information for a period of time to calculate the current Gain gainPre(n) on the current audio frame, so as to obtain the audio xGainPre(n) with dynamic preset gain applied, the calculation method is as follows.
  • the status of the current audio frame may refer to the current frame number corresponding to the current audio frame.
  • the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold.
  • the upper limit of the energy interval that meets the cumulative sum equal to the preset threshold will be used as the first energy limit; and the lower limit of the energy range greater than the preset threshold as the first energy limit.
  • the limit energyLevel (ie the first energy limit) where the energy below the percentage threshold probThre (ie the preset threshold) is statistically distributed according to HistogramEnergy (n), that is, the energy of the speech energy distribution data statistically in a period of time is in Below energyLevel, here, probThre can be set to 95%, but not limited to this.
  • gainPreRaw EnergyTarget-energyLevel (8)
  • EnergyTarget is the energy expected to be achieved by the current audio frame, here, it can be set to -18dB, and can also be adjusted according to requirements.
  • the gainPreRaw obtained at this time requires a certain gain range control. According to the actual situation, the gain range is generally [-6dB, 12dB], and it can also be adjusted according to the needs.
  • the gainPreRaw can be adjusted according to the following equation (9):
  • the dynamic preset gain gainPre(n) (that is, the first gain) to be applied to the current audio is calculated according to the above-calculated initial dynamic preset gain gainPreRaw and silenceState(n).
  • the initial first gain may be adjusted by comparing the frame number corresponding to the current audio frame with the preset frame number, and the adjusted initial first gain may be used as the first gain.
  • the method can be as follows.
  • silenceState(n) the number of audio frames corresponding to silThre for a period of time, the period of time can be 1 second to 2 seconds
  • gainPre(n) gainPreRaw
  • gainPre(n) gainPreRaw ⁇ (1-sAtt)+gainPre(n-1) ⁇ sAtt, where sAtt is the follow-up smoothing factor, generally It is set to 0.9999, and it can also be set according to the actual situation;
  • gainPre(n) gainPreRaw ⁇ (1-sRel)+gainPre(n-1) ⁇ sRel, where sRel is the release smoothing factor, which is generally It is set to 0.99, and it can also be set according to the actual situation.
  • An audio signal to which a dynamic preset gain (first-stage gain) is applied can be obtained.
  • Input gainPre(n), xGainPre(n), energyraw(n) and vad(n) to the secondary gain calculation module to further calculate the second gain gainPost(n) that needs to be applied to the current audio frame at this time.
  • the above gain can be regarded as a gain vector.
  • the audio energy (that is, the second energy limit) after the dynamic preset gain is applied can be obtained according to gainPre(n) and energyraw(n), as shown in the following equation (11):
  • energyGainPreSmooth(n) energyGainPreSmooth(n-1) ⁇ smoothEnergy+energyGainPreRaw ⁇ (1-smoothEnergy) (12)
  • smoothEnergy represents the smoothing factor
  • energyGainPreSmooth(n) represents the smoothed audio energy of the current audio frame
  • energyGainPreSmooth(n-1) represents the smoothed audio energy of the previous audio frame.
  • energyGainPreSmooth(n-1) may be set to zero.
  • the expected value of the secondary gain (ie, the initial second gain) of the current audio frame can be calculated, as shown in the following equation (13):
  • gainPostRaw Similar to the calculation of gainPreRaw, gainPostRaw also needs to limit the gain range:
  • the gain vector gainPost(n) of the current audio frame can be obtained according to the above-mentioned gain after gain control processing. Can be expressed as the following form:
  • gainPost(n) [gainPost 1 (n), gainPost 2 (n),..., gainPost M (n)] T
  • xGainPre(n) [xGainPre 1 (n),xGainPre 2 (n),...,xGainPre M (n)] T
  • i represents the i-th element in the current secondary gain vector
  • gainPost M (n-1) represents the gain of the Mth sampling point of the previous audio frame (ie, the last sampling point of the previous audio frame).
  • the secondary gain vector gainPost(n) obtained above is multiplied by the corresponding element of xGainPre(n) (it should be noted that the gain unit needs to be converted from dBFS to linear gain), and the audio signal xGainPost with the secondary gain applied can be obtained (n), as shown in equation (15) below:
  • Limiter[*] indicates that the amplitude of the input signal is limited and protected, and y(n) is the final output of a frame of audio signal after AGC processing.
  • FIG. 3 is a block diagram of an audio processing device according to an embodiment of the present disclosure.
  • the audio processing device 300 may include an acquisition module 301 , a determination module 302 , a first gain module 303 , and a second gain module 304 .
  • Each module in the audio processing device 300 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the type of the module. In various embodiments, some modules in the audio processing device 300 may be omitted, or additional modules may also be included. Also, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the corresponding modules/elements before combination.
  • the acquiring module 301 can acquire the current audio frame to be processed.
  • the determining module 302 can determine the energy and type of the current audio frame, the type including one of a speech frame and a non-speech frame.
  • the determining module 302 can obtain speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data can be used to count the proportions of speech frames in different energy intervals.
  • the first gain module 303 can determine the first gain for the current audio frame according to the speech energy distribution data for the current audio frame, and apply the first gain to the current audio frame to obtain the first audio frame.
  • the second gain module 304 may determine a second gain for the current audio frame based on the first gain and the energy of the current audio frame, and apply the second gain to the first audio frame to obtain the second audio frame.
  • the determination module 302 may use the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution data of the current audio frame.
  • the determination module 302 may update the speech energy distribution data of the previous audio frame based on the energy of the current audio frame, wherein the current audio frame is For the first frame, the determining module 302 may update the initial speech energy distribution data based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is evenly distributed with a proportion of the speech frame.
  • the determination module 302 may determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame, and decrease The ratio of speech frames in the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
  • the determination module 302 can calculate the sum of the speech frame proportions of each energy interval in the updated speech energy distribution data, determine the residual probability by comparing the sum of the speech frame proportions with the preset value, and distribute the residual probability to Each energy range of the updated speech energy distribution data, until the sum of the proportions of speech frames in each energy range of the updated speech energy distribution data is a preset value.
  • the preset value can be set to 1.
  • the first gain module 303 may sequentially accumulate speech frame proportions of each energy interval starting from the first energy interval of the speech energy distribution data of the current audio frame until the accumulated sum is equal to or greater than a preset threshold. When the accumulated sum is equal to the preset threshold, the first gain module 303 may use the upper limit of the energy range accumulated to satisfy the accumulated sum equal to the preset threshold as the first energy limit. When the accumulated sum is greater than the preset threshold, the first gain module 303 may use the lower limit of the energy range that is accumulated to meet the requirement that the accumulated sum is greater than the preset threshold as the first energy limit. Then the first gain module 303 can determine the first gain according to the target energy of the current audio frame and the first energy limit.
  • the first gain module 303 can determine the initial first gain according to the target energy of the current audio frame and the first energy limit, and determine the frame number corresponding to the current audio frame according to the type of the current audio frame, by comparing the frame number corresponding to the current audio frame with the The preset frame numbers are compared to adjust the initial first gain and the adjusted initial first gain is used as the first gain.
  • the second gain module 304 may determine the second energy limit based on the first gain and the energy of the current audio frame, determine the initial second gain according to the target energy of the current audio frame and the second energy limit, and determine the audio sampling point in the current audio frame Change the initial second gain to the second gain vector.
  • the second gain module 304 can calculate the gain for each audio sampling point in the current audio frame based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, so as to generate a second gain vector .
  • the second gain module 304 can apply each gain in the second gain vector to corresponding audio sampling points of the first audio frame to obtain a second audio frame, and perform clipping processing on the amplitude of the second audio frame.
  • Fig. 4 is a schematic structural diagram of an audio processing device in a hardware operating environment according to an embodiment of the present disclosure.
  • the audio processing device 400 may include: a processing component 401 , a communication bus 402 , a network interface 403 , an input and output interface 404 , a memory 405 and a power supply component 404 .
  • the communication bus 402 is used to realize connection and communication between these components.
  • the input and output interface 404 may include a video display (such as a liquid crystal display), a microphone and a speaker, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and in some embodiments, the input and output interface 404 may also include a standard Wired interface, wireless interface.
  • the network interface 403 may include a standard wired interface and a wireless interface (such as a Wi-Fi interface).
  • the memory 405 can be a high-speed random access memory, or a stable non-volatile memory.
  • the memory 405 may also be a storage device independent of the aforementioned processing component 401.
  • FIG. 4 does not constitute a limitation to the audio processing device 400, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • memory 405 as a storage medium may include an operating system (such as MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.
  • an operating system such as MAC operating system
  • the network interface 403 is mainly used for data communication with external electronic devices/terminals; the input and output interface 404 is mainly used for data interaction with the user; the processing component 401 in the audio processing device 400 , the memory 405 can be set in the audio processing device 400, and the audio processing device 400 calls the audio processing programs, materials and various APIs stored in the memory 405 through the processing component 401, and executes the audio processing provided by the embodiment of the present disclosure.
  • Approach the network interface 403 is mainly used for data communication with external electronic devices/terminals; the input and output interface 404 is mainly used for data interaction with the user; the processing component 401 in the audio processing device 400 , the memory 405 can be set in the audio processing device 400, and the audio processing device 400 calls the audio processing programs, materials and various APIs stored in the memory 405 through the processing component 401, and executes the audio processing provided by the embodiment of the present disclosure. Approach.
  • the processing component 401 may include at least one processor, and a set of computer-executable instructions is stored in the memory 405.
  • the set of computer-executable instructions is executed by the at least one processor, the audio processing method according to the embodiment of the present disclosure is executed.
  • the above-mentioned examples are only exemplary, and the present disclosure is not limited thereto.
  • the processing component 401 can obtain the current audio frame to be processed, determine the energy and type of the current audio frame, obtain the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, and obtain the speech energy distribution data for the current audio frame according to the speech energy distribution for the current audio frame data to determine a first gain for the current audio frame, apply the first gain to the current audio frame to obtain the first audio frame, and then determine a second gain for the current audio frame based on the first gain and the energy of the current audio frame Gain, to apply a second gain to the first audio frame to obtain a second audio frame.
  • the processing component 401 can realize control of components included in the audio processing device 400 by executing a program.
  • the audio processing device 400 may receive or output video and/or audio via the input-output interface 404 .
  • the audio processing device 400 can output the audio signal after the gain is applied via the input and output interface 404 .
  • the audio processing device 400 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above-mentioned set of instructions.
  • the audio processing device 400 does not have to be a single electronic device, but can also be any assembly of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets).
  • the audio processing device 400 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
  • the processing component 401 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller or a microprocessor.
  • the processing component 401 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
  • Processing component 401 may execute instructions or codes stored in memory, where memory 405 may also store data. Instructions and data can also be sent and received via the network via the network interface 403, wherein the network interface 403 can adopt any known transmission protocol.
  • the memory 405 may be integrated with the processing component 401, for example, by placing RAM or flash memory within an integrated circuit microprocessor or the like. Additionally, storage 405 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system.
  • the memory and processing component 401 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., such that the processing component 401 can read data stored in the memory 405 .
  • an electronic device may include at least one memory 502 and at least one processor 501.
  • the at least one memory 502 stores a set of computer-executable instructions.
  • the audio processing method according to the embodiment of the present disclosure is executed.
  • Processor 501 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor.
  • the processor 501 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
  • the memory 502 as a storage medium may include an operating system (eg, MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.
  • an operating system eg, MAC operating system
  • the memory 502 can be integrated with the processor 501, for example, RAM or flash memory can be arranged in an integrated circuit microprocessor or the like. Additionally, memory 502 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. Memory 502 and processor 501 may be operatively coupled, or may communicate with each other, such as through an I/O port, network connection, etc., such that processor 501 can read files stored in memory 502 .
  • the electronic device 500 may further include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.
  • FIG. 5 does not constitute a limitation, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
  • a computer-readable storage medium storing instructions, wherein, when the instructions are executed by at least one processor, at least one processor is prompted to execute the audio processing method according to the present disclosure.
  • Examples of computer readable storage media herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Flash Memory, Non-volatile Memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Memory, Hard Disk Drive (HDD), Solid State Hard disks (SSD), memory cards (such as MultiMediaCards, Secure Digital (SD) or Extreme Digital (
  • the computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc.
  • the computer program and any associated data and data files and data structures are distributed over network-connected computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
  • a computer program product may also be provided, and instructions in the computer program product may be executed by a processor of a computer device to complete the above audio processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

An audio processing method and an audio processing apparatus. The audio processing method comprises the following steps: acquiring the current audio frame to be processed; determining the energy and the type of the current audio frame, wherein the type comprises one of a voice frame and a non-voice frame; on the basis of the energy and the type of the current audio frame, obtaining voice energy distribution data for the current audio frame, wherein the voice energy distribution data is used for compiling statistics on proportions occupied by voice frames of different energy intervals; according to the voice energy distribution data for the current audio frame, determining a first gain for the current audio frame; and applying the first gain to the current audio frame to obtain a first audio frame.

Description

音频处理方法和音频处理装置Audio processing method and audio processing device 技术领域technical field
本公开基于申请号为202111465600.1、申请日为2021年12月3日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。This disclosure is based on a Chinese patent application with application number 202111465600.1 and a filing date of December 3, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.
技术领域technical field
本公开涉及音频技术领域,尤其涉及一种用于自动增益控制的音频处理方法和音频处理装置。The present disclosure relates to the field of audio technology, and in particular, to an audio processing method and an audio processing device for automatic gain control.
背景技术Background technique
自动增益控制(Automatic Gain Control,AGC)是音频处理领域的关键技术,被广泛应用于实时通讯等领域,其基本目的在于根据输入的音频信号音量大小给音频信号施加不同大小的增益,使输出的音频信号音量稳定在一定范围内,避免由于不同说话人声音音量差异或者距离设备远近等原因造成的声音过大或过小。AGC技术对于输出音频的音量控制能力和处理后的音频音质均有较高要求。其中,音量控制能力主要体现在增益收敛时间(即对于一段稳定音量音频,计算得到合理音量音频需要的时间)和增益控制范围(即增益变化的范围),并且音频音质主要体现在客观语音质量评估(Perceptual evaluation of speech quality,PESQ)和听感客观音量分析(Perceptual Objective Listening Quality Analysis,POLQA)等客观评估指标的得分。然而,现有的AGC技术难以平衡音频音量控制的能力和处理后的音频音质。Automatic Gain Control (AGC) is a key technology in the field of audio processing. It is widely used in real-time communication and other fields. Its basic purpose is to apply different levels of gain to the audio signal according to the volume of the input audio signal, so that the output The volume of the audio signal is stable within a certain range to avoid the sound being too loud or too small due to the difference in the volume of the voice of different speakers or the distance from the device. The AGC technology has high requirements for the volume control capability of the output audio and the sound quality of the processed audio. Among them, the volume control ability is mainly reflected in the gain convergence time (that is, the time required to calculate a reasonable volume audio for a period of stable volume audio) and the gain control range (that is, the range of gain change), and the audio quality is mainly reflected in the objective voice quality evaluation (Perceptual evaluation of speech quality, PESQ) and hearing objective volume analysis (Perceptual Objective Listening Quality Analysis, POLQA) and other objective evaluation indicators. However, it is difficult for the existing AGC technology to balance the audio volume control capability and the processed audio quality.
发明内容Contents of the invention
本公开提供一种音频处理方法和音频处理装置。The disclosure provides an audio processing method and an audio processing device.
根据本公开实施例的第一方面,提供一种音频处理方法,可包括:获取待处理的当前音频帧;确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;基于所述当前音频帧的能量和类型来获得针对所述当 前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;对所述当前音频帧应用所述第一增益以获得第一音频帧。According to the first aspect of an embodiment of the present disclosure, there is provided an audio processing method, which may include: acquiring a current audio frame to be processed; determining the energy and type of the current audio frame, the type including a speech frame and a non-speech frame One; obtain the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportion of speech frames in different energy intervals; according to the The speech energy distribution data of the current audio frame is used to determine the first gain for the current audio frame; the first gain is applied to the current audio frame to obtain the first audio frame.
在一些实施例中,基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,可包括:当所述当前音频帧的能量小于预设噪声门限或者所述当前音频帧是非语音帧时,将所述当前音频帧的前一音频帧的语音能量分布数据作为所述当前音频帧的语音能量分布数据;当所述当前音频帧的能量大于或等于所述预设噪声门限并且所述当前音频帧是语音帧时,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,其中,当所述当前音频帧是首帧时,基于所述当前音频帧的能量更新初始语音能量分布数据,所述初始语音能量分布数据的各个能量区间均匀分布语音帧所占的比例。In some embodiments, obtaining the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame may include: when the energy of the current audio frame is less than a preset noise threshold or the current When the audio frame is a non-speech frame, the speech energy distribution data of the previous audio frame of the current audio frame is used as the speech energy distribution data of the current audio frame; when the energy of the current audio frame is greater than or equal to the preset Noise threshold and when the current audio frame is a speech frame, update the speech energy distribution data of the previous audio frame based on the energy of the current audio frame, wherein, when the current audio frame is the first frame, based on the The energy of the current audio frame updates the initial speech energy distribution data, and each energy interval of the initial speech energy distribution data uniformly distributes the proportion of the speech frame.
在一些实施例中,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,可包括:确定所述当前音频帧的能量在语音能量分布数据中所属的能量区间;增大所述前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例;减小所述前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。In some embodiments, updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame may include: determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data; Increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame; reduce the speech energy distribution data of the previous audio frame that does not correspond to the determined energy interval The proportion of speech frames in the energy interval of .
在一些实施例中,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,可包括:计算更新后的语音能量分布数据中的各个能量区间的语音帧比例之和;通过对所述语音帧比例之和与预设值进行比较来确定残差概率;将所述残差概率分配到更新后的语音能量分布数据的的各个能量区间,直到更新后的语音能量分布数据的各个能量区间的语音帧所占的比例之和为所述预设值。In some embodiments, updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame may include: calculating the sum of speech frame ratios of each energy interval in the updated speech energy distribution data; Determine the residual probability by comparing the sum of the speech frame ratios with a preset value; distribute the residual probability to each energy interval of the updated speech energy distribution data until the updated speech energy distribution data The sum of the proportions of the speech frames in each energy interval of is the preset value.
在一些实施例中,根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益,可包括:从针对所述当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到所述累加之和等于或大于预设阈值;当所述累加之和等于所述预设阈值时,将累加到满足所述累加之和等于所述预设阈值的能量区间的上限作为第一能量界限;当所述累加之和大于所述预设阈值时,将累加到满足所述累加之和大于所述预设阈值的能量区间的下限作为第一能量界限;根据所述当前音频 帧的目标能量和所述第一能量界限来确定所述第一增益。In some embodiments, determining the first gain for the current audio frame according to the speech energy distribution data for the current audio frame may include: from the first gain for the speech energy distribution data for the current audio frame Energy intervals start to accumulate the speech frame ratios of each energy interval in turn until the sum of the accumulation is equal to or greater than the preset threshold; when the sum of the accumulation is equal to the preset threshold, it will be accumulated until the sum of the accumulation is satisfied The upper limit of the energy interval equal to the preset threshold is used as the first energy limit; when the accumulated sum is greater than the preset threshold, it will be accumulated to the energy interval satisfying that the accumulated sum is greater than the preset threshold The lower limit is used as a first energy limit; the first gain is determined according to the target energy of the current audio frame and the first energy limit.
在一些实施例中,根据所述当前音频帧的目标能量和所述第一能量界限来确定所述第一增益,可包括:根据所述当前音频帧的目标能量和所述第一能量界限来确定初始第一增益;根据所述当前音频帧的类型确定所述当前音频帧对应的帧数;通过对所述当前音频帧对应的帧数与预设帧数进行比较来调整所述初始第一增益并且将调整后的初始第一增益作为所述第一增益。In some embodiments, determining the first gain according to the target energy of the current audio frame and the first energy limit may include: determining according to the target energy of the current audio frame and the first energy limit Determine the initial first gain; determine the frame number corresponding to the current audio frame according to the type of the current audio frame; adjust the initial first gain by comparing the frame number corresponding to the current audio frame with a preset frame number gain and use the adjusted initial first gain as the first gain.
在一些实施例中,所述音频处理方法还可包括:基于所述第一增益和所述当前音频帧的能量确定第二能量界限;根据所述当前音频帧的目标能量和所述第二能量界限来确定初始第二增益;基于所述当前音频帧中的音频采样点和所述初始第二增益来得到第二增益矢量;对所述第一音频帧应用所述第二增益以获得第二音频帧。In some embodiments, the audio processing method may further include: determining a second energy limit based on the first gain and the energy of the current audio frame; Limit to determine the initial second gain; based on the audio sampling points in the current audio frame and the initial second gain to obtain a second gain vector; apply the second gain to the first audio frame to obtain a second audio frame.
在一些实施例中,基于所述当前音频帧中的音频采样点和所述初始第二增益来得到第二增益矢量,可包括:基于所述当前音频帧的前一音频帧中最后一个音频采样点的增益以及所述初始第二增益分别计算针对所述当前音频帧中的每个音频采样点的增益,以生成所述第二增益矢量。In some embodiments, obtaining the second gain vector based on the audio sample points in the current audio frame and the initial second gain may include: based on the last audio sample in the previous audio frame of the current audio frame The point gain and the initial second gain respectively calculate the gain for each audio sampling point in the current audio frame, so as to generate the second gain vector.
在一些实施例中,对所述第一音频帧应用所述第二增益以获得第二音频帧,可包括:将所述第二增益矢量中的每个增益分别应用于所述第一音频帧的相应音频采样点,以获得第二音频帧;并且对所述第二音频帧的幅度进行限幅处理。In some embodiments, applying the second gain to the first audio frame to obtain a second audio frame may include: applying each gain in the second gain vector to the first audio frame respectively corresponding audio sampling points to obtain a second audio frame; and limit the amplitude of the second audio frame.
根据本公开实施例的第二方面,提供一种音频处理装置,可包括:获取模块,被配置为获取待处理的当前音频帧;确定模块,被配置为确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;并且基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;第一增益模块,被配置为根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;并且对所述当前音频帧应用所述第一增益以获得第一音频帧。According to a second aspect of an embodiment of the present disclosure, there is provided an audio processing device, which may include: an acquisition module configured to acquire a current audio frame to be processed; a determination module configured to determine the energy and type of the current audio frame , the type includes one of a speech frame and a non-speech frame; and based on the energy and type of the current audio frame, speech energy distribution data for the current audio frame is obtained, wherein the speech energy distribution data is used to count different energies The proportion of the speech frame of the interval; the first gain module is configured to determine the first gain for the current audio frame according to the speech energy distribution data for the current audio frame; and for the current audio frame The first gain is applied to obtain a first audio frame.
在一些实施例中,确定模块可被配置为:当所述当前音频帧的能量小于预设噪声门限或者所述当前音频帧是非语音帧时,将所述当前音频帧的前一音频帧的语音能量分布数据作为所述当前音频帧的语音能量分布数据;当所述当前音频帧的能量大于或等于所述预设噪声门限并且所述当前音频帧是语 音帧时,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,其中,当所述当前音频帧是首帧时,基于所述当前音频帧的能量更新初始语音能量分布数据,所述初始语音能量分布数据的各个能量区间均匀分布语音帧所占的比例。In some embodiments, the determination module may be configured to: when the energy of the current audio frame is less than a preset noise threshold or the current audio frame is a non-speech frame, the speech of the previous audio frame of the current audio frame The energy distribution data is used as the speech energy distribution data of the current audio frame; when the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, based on the current audio frame energy updating the speech energy distribution data of the previous audio frame, wherein, when the current audio frame is the first frame, updating the initial speech energy distribution data based on the energy of the current audio frame, the initial speech energy distribution data The proportion of speech frames that are uniformly distributed in each energy interval.
在一些实施例中,确定模块可被配置为:确定所述当前音频帧的能量在语音能量分布数据中所属的能量区间;增大所述前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例;减小所述前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。In some embodiments, the determination module may be configured to: determine the energy range to which the energy of the current audio frame belongs in the speech energy distribution data; increase the difference between the determined speech energy distribution data of the previous audio frame The speech frame ratio of the energy interval corresponding to the energy interval; reducing the speech frame ratio of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
在一些实施例中,确定模块可被配置为:计算更新后的语音能量分布数据中的各个能量区间的语音帧比例之和;通过对所述语音帧比例之和与预设值进行比较来确定残差概率;将所述残差概率分配到更新后的语音能量分布数据的的各个能量区间,直到更新后的语音能量分布数据的各个能量区间的语音帧所占的比例之和为所述预设值。In some embodiments, the determination module may be configured to: calculate the sum of speech frame ratios of each energy interval in the updated speech energy distribution data; determine by comparing the sum of speech frame ratios with a preset value Residual probability; assigning the residual probability to each energy interval of the updated speech energy distribution data, until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data is the preset set value.
在一些实施例中,第一增益模块可被配置为:从针对所述当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到所述累加之和等于或大于预设阈值;当所述累加之和等于所述预设阈值时,将累加到满足所述累加之和等于所述预设阈值的能量区间的上限作为第一能量界限;当所述累加之和大于所述预设阈值时,将累加到满足所述累加之和大于所述预设阈值的能量区间的下限作为第一能量界限;根据所述当前音频帧的目标能量和所述第一能量界限来确定所述第一增益。In some embodiments, the first gain module may be configured to: start from the first energy interval of the speech energy distribution data for the current audio frame and sequentially accumulate the speech frame proportions of each energy interval until the accumulated sum Equal to or greater than the preset threshold; when the accumulated sum is equal to the preset threshold, the upper limit of the energy range accumulated to satisfy the accumulated sum equal to the preset threshold is taken as the first energy limit; when the When the accumulated sum is greater than the preset threshold, the lower limit of the energy interval accumulated to satisfy the accumulated sum greater than the preset threshold is taken as the first energy limit; according to the target energy of the current audio frame and the first energy limit An energy bound is used to determine the first gain.
在一些实施例中,第一增益模块可被配置为:根据所述当前音频帧的目标能量和所述第一能量界限来确定初始第一增益;根据所述当前音频帧的类型确定所述当前音频帧对应的帧数;通过对所述当前音频帧对应的帧数与预设帧数进行比较来调整所述初始第一增益并且将调整后的初始第一增益作为所述第一增益。In some embodiments, the first gain module may be configured to: determine an initial first gain according to the target energy of the current audio frame and the first energy limit; determine the current A frame number corresponding to an audio frame; adjusting the initial first gain by comparing the frame number corresponding to the current audio frame with a preset frame number and using the adjusted initial first gain as the first gain.
在一些实施例中,所述音频处理装置还可包括第二增益模块,被配置为:基于所述第一增益和所述当前音频帧的能量确定第二能量界限;根据所述当前音频帧的目标能量和所述第二能量界限来确定初始第二增益;基于所述当前音频帧中的音频采样点和所述初始第二增益来得到第二增益矢量;对所述第一音频帧应用所述第二增益以获得第二音频帧。In some embodiments, the audio processing device may further include a second gain module configured to: determine a second energy limit based on the first gain and the energy of the current audio frame; Determine the initial second gain based on the target energy and the second energy limit; obtain a second gain vector based on the audio sampling points in the current audio frame and the initial second gain; apply the first audio frame to the first audio frame The second gain is obtained to obtain a second audio frame.
在一些实施例中,第二增益模块可被配置为:基于所述当前音频帧的前一音频帧中最后一个音频采样点的增益以及所述初始第二增益分别计算针对所述当前音频帧中的每个音频采样点的增益,以生成所述第二增益矢量。In some embodiments, the second gain module may be configured to: calculate the gain for the current audio frame based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively. the gain of each audio sampling point to generate the second gain vector.
在一些实施例中,第二增益模块可被配置为:将所述第二增益矢量中的每个增益分别应用于所述第一音频帧的相应音频采样点,以获得第二音频帧;并且对所述第二音频帧的幅度进行限幅处理。In some embodiments, the second gain module may be configured to: apply each gain in the second gain vector to corresponding audio sampling points of the first audio frame to obtain a second audio frame; and Perform clipping processing on the amplitude of the second audio frame.
根据本公开实施例的第三方面,提供一种电子设备,所述电子设备可包括:至少一个处理器;至少一个存储计算机可执行指令的存储器,其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行如上所述的音频处理方法。According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, the electronic device may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions are executed by the When the at least one processor is running, the at least one processor is prompted to execute the audio processing method as described above.
根据本公开实施例的第四方面,提供一种存储指令的计算机可读存储介质,当所述指令被至少一个处理器运行时,促使所述至少一个处理器执行如上所述的音频处理方法。According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions, and when the instructions are executed by at least one processor, the at least one processor is prompted to execute the audio processing method as described above.
根据本公开实施例的第五方面,提供一种计算机程序产品,所述计算机程序产品中的指令被电子装置中的至少一个处理器运行以执行如上所述的音频处理方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product, where instructions in the computer program product are executed by at least one processor in an electronic device to execute the audio processing method as described above.
本公开的实施例提供的技术方案至少带来以下有益效果:The technical solutions provided by the embodiments of the present disclosure bring at least the following beneficial effects:
能够更好地控制语音音量,并且实现较短的增益收敛时间和较大的增益控制范围。在保证较好的音量控制的同时能够实现相对稳定的增益,同时保证获得更高音质的音频。此外,通过使用输入语音的能量分布数据分布更加准确地确定动态预设增益,从而获得更高音质的语音。It can better control the voice volume, and achieve a shorter gain convergence time and a larger gain control range. While ensuring better volume control, it can achieve relatively stable gain, and at the same time ensure higher sound quality audio. In addition, the dynamic preset gain is determined more accurately by using the energy distribution data distribution of the input speech, thereby obtaining speech with higher sound quality.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理,并不构成对本公开的不当限定。The accompanying drawings here are incorporated into the specification and constitute a part of the specification, show embodiments consistent with the disclosure, and are used together with the description to explain the principle of the disclosure, and do not constitute an improper limitation of the disclosure.
图1是根据本公开的实施例的音频处理方法的流程图;Fig. 1 is a flowchart of an audio processing method according to an embodiment of the present disclosure;
图2是根据本公开的实施例的音频处理方法的流程示意图;2 is a schematic flow diagram of an audio processing method according to an embodiment of the present disclosure;
图3是根据本公开的实施例的音频处理装置的框图;3 is a block diagram of an audio processing device according to an embodiment of the present disclosure;
图4是根据本公开的实施例的音频处理设备的结构示意图;4 is a schematic structural diagram of an audio processing device according to an embodiment of the present disclosure;
图5是根据本公开的实施例的电子设备的框图。FIG. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure.
在整个附图中,应注意,相同的参考标号用于表示相同或相似的元件、特征和结构。Throughout the drawings, it should be noted that like reference numbers are used to represent the same or similar elements, features, and structures.
具体实施方式Detailed ways
为了使本领域普通人员更好地理解本公开的技术方案,下面将结合附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to enable ordinary persons in the art to better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings.
提供参照附图的以下描述以帮助对由权利要求及其等同物限定的本公开的实施例的全面理解。包括各种特定细节以帮助理解,但这些细节仅被视为是示例性的。因此,本领域的普通技术人员将认识到在不脱离本公开的范围和精神的情况下,可对描述于此的实施例进行各种改变和修改。此外,为了清楚和简洁,省略对公知的功能和结构的描述。The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the present disclosure as defined by the claims and their equivalents. Various specific details are included to aid in understanding but are to be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
以下描述和权利要求中使用的术语和词语不限于书面含义,而仅由发明人用来实现本公开的清楚且一致的理解。因此,本领域的技术人员应清楚,本公开的各种实施例的以下描述仅被提供用于说明目的而不用于限制由权利要求及其等同物限定的本公开的目的。The terms and words used in the following description and claims are not limited to the bibliographical meanings, but, are merely used by the inventor to enable a clear and consistent understanding of the present disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of various embodiments of the present disclosure are provided for illustration purpose only and not for the purpose of limiting the present disclosure as defined by the claims and their equivalents.
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first" and "second" in the specification and claims of the present disclosure and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present disclosure as recited in the appended claims.
现有的AGC算法可对输入音频预先设置一个固定的较高增益并且对该音频进行幅度限制的保护,然而,该方案对于大音量再施加较高增益的情况,会导致幅度限制模块对音频波形产生极大的失真,难以保证较高的音质;或者,可参考一定时间的音频能量大小并且计算当前需要向音频所施加的合理增益,然而,由于输入音频音量在短期和长期都会存在变化,该方案通常会引起增益变化幅度过大或音量反应不灵敏从而导致一段音频需要较长时间才能得到合理增益,难以平衡音频音量控制的能力和处理后的音频音质。The existing AGC algorithm can pre-set a fixed higher gain for the input audio and protect the audio with amplitude limitation. However, this solution will cause the amplitude limitation module to affect the audio waveform when a higher gain is applied at a large volume. It produces great distortion and it is difficult to guarantee high sound quality; or, you can refer to the audio energy for a certain period of time and calculate the reasonable gain that needs to be applied to the audio at present. However, due to the short-term and long-term changes in the input audio volume, the The solution usually causes the gain to change too much or the volume response is not sensitive, which will take a long time to get a reasonable gain for a piece of audio, and it is difficult to balance the ability of audio volume control and the quality of the processed audio.
与AGC算法的常用方案不同,旨在提出一种动态预设增益和短时能量增益控制相结合的AGC方法,在保证音质的同时能够让算法有较强的音频增益控制能力,避免音频增益收敛速度慢或增益忽大忽小等。Different from the common scheme of AGC algorithm, the aim is to propose an AGC method combining dynamic preset gain and short-term energy gain control, which can ensure the sound quality while enabling the algorithm to have a strong audio gain control ability and avoid audio gain convergence The speed is slow or the gain fluctuates, etc.
在下文中,根据本公开的各种实施例,将参照附图对本公开的方法、装置以及系统进行详细描述。Hereinafter, according to various embodiments of the present disclosure, the method, device, and system of the present disclosure will be described in detail with reference to the accompanying drawings.
图1是根据本公开的实施例的音频处理方法的流程图。根据本公开的音频处理方法可实现高音质的自动增益控制。FIG. 1 is a flowchart of an audio processing method according to an embodiment of the present disclosure. The audio processing method according to the present disclosure can realize automatic gain control with high sound quality.
根据本公开的音频处理方法可由任意具有音频处理功能的电子设备执行。电子设备可以是智能手机、平板电脑、便携式计算机和台式计算机等中的至少一种。电子设备可安装有目标应用,用于对输入的音频进行自动增益控制。The audio processing method according to the present disclosure can be executed by any electronic device having an audio processing function. The electronic device may be at least one of a smart phone, a tablet computer, a laptop computer, a desktop computer, and the like. The electronic device may be installed with a target application for automatic gain control of incoming audio.
参照图1,在步骤S101,获取待处理的当前音频帧。对于待处理的输入音频,可对该输入音频执行分帧处理,然后针对每一音频帧执行后面描述的操作。这里,每一音频帧可包括若干个音频采样点。例如,一个音频帧可包含在10-25ms时间内的信号采样点。Referring to FIG. 1, in step S101, the current audio frame to be processed is acquired. For the input audio to be processed, frame division processing can be performed on the input audio, and then the operation described later is performed for each audio frame. Here, each audio frame may include several audio sample points. For example, an audio frame may contain signal samples over a period of 10-25 ms.
在步骤S102,确定当前音频帧的能量和类型,这里,类型可包括语音帧和非语音帧之一。例如,可采用语音活动检测算法来检测当前音频帧是语音帧还是非语音帧(非语音帧包括噪声或无声等情况)。In step S102, the energy and type of the current audio frame are determined, where the type may include one of a speech frame and a non-speech frame. For example, a speech activity detection algorithm may be used to detect whether the current audio frame is a speech frame or a non-speech frame (non-speech frames include noise or silence, etc.).
在步骤S103,基于当前音频帧的能量和类型来获得针对当前音频帧的语音能量分布数据,其中,语音能量分布数据可用于统计不同能量区间的语音帧所占的比例。语音能量分布数据可以以直方图的形式表示,例如,语音能量直方图可表示不同能量区间的语音帧所占的比例。In step S103, the speech energy distribution data for the current audio frame is obtained based on the energy and type of the current audio frame, wherein the speech energy distribution data can be used to count the proportions of speech frames in different energy intervals. The speech energy distribution data may be expressed in the form of a histogram. For example, the speech energy histogram may indicate the proportion of speech frames in different energy intervals.
在一些实施例中,在当前音频帧的能量小于预设噪声门限或者当前音频帧是非语音帧时,可将当前音频帧的前一音频帧的语音能量分布数据作为当前音频帧的语音能量分布数据。在当前音频帧的能量大于或等于预设噪声门限并且当前音频帧是语音帧时,可基于当前音频帧的能量更新当前音频帧的前一音频帧的语音能量分布数据。这里,预设噪声门限可根据实际情况被不同地设置。In some embodiments, when the energy of the current audio frame is less than the preset noise threshold or the current audio frame is a non-speech frame, the speech energy distribution data of the previous audio frame of the current audio frame can be used as the speech energy distribution data of the current audio frame . When the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, the speech energy distribution data of the previous audio frame of the current audio frame may be updated based on the energy of the current audio frame. Here, the preset noise threshold can be set differently according to actual conditions.
这里,在当前音频帧是输入音频的首帧时,可基于当前音频帧的能量更新初始语音能量分布数据,初始语音能量分布数据的各个能量区间均匀分布语音帧所占的比例。例如,可将初始语音能量分布数据划分为若干个能量区间,各能量区间的初始概率设置为均匀分布,并且各能量区间的初始概率之 和为1。Here, when the current audio frame is the first frame of the input audio, the initial speech energy distribution data may be updated based on the energy of the current audio frame, and the proportion of each energy interval of the initial speech energy distribution data is uniformly distributed by the speech frame. For example, the initial speech energy distribution data can be divided into several energy intervals, the initial probability of each energy interval is set to a uniform distribution, and the sum of the initial probabilities of each energy interval is 1.
在针对当前音频帧更新语音能量分布数据时,可首先确定当前音频帧的能量在语音能量分布数据中所属的能量区间,然后增大前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例,减小前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。When updating the speech energy distribution data for the current audio frame, first determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, and then increase the speech energy distribution data of the previous audio frame and the determined energy interval The speech frame proportion of the corresponding energy interval is reduced by reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
在按照上述方式增大或减小相应能量区间的语音帧比例后,由于需要保证语音能量分布数据中所有概率和为1,所以需要计算更新后的语音能量分布数据中的各个能量区间的语音帧比例之和,通过对语音帧比例之和与预设值进行比较来确定残差概率,然后将该残差概率分配到更新后的语音能量分布数据的的各个能量区间,直到更新后的语音能量分布数据的各个能量区间的语音帧所占的比例之和为该预设值。例如,预设值可以取值为1,但本公开不限于此。After increasing or decreasing the speech frame ratio of the corresponding energy interval according to the above method, since it is necessary to ensure that the sum of all probabilities in the speech energy distribution data is 1, it is necessary to calculate the speech frames of each energy interval in the updated speech energy distribution data The sum of the proportions, the residual probability is determined by comparing the sum of the speech frame proportions with the preset value, and then the residual probability is assigned to each energy interval of the updated speech energy distribution data until the updated speech energy The sum of the proportions of speech frames in each energy interval of the distribution data is the preset value. For example, the preset value may be 1, but the disclosure is not limited thereto.
在步骤S104,根据针对当前音频帧的语音能量分布数据来确定用于当前音频帧的第一增益。在本公开中,第一增益也可被称为一级增益。作为示例,从针对当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到累加之和等于或大于预设阈值,当累加之和等于预设阈值时,将累加到满足累加之和等于预设阈值的能量区间的上限作为第一能量界限;当累加之和大于预设阈值时,将累加到满足累加之和大于预设阈值的能量区间的下限作为第一能量界限。In step S104, a first gain for the current audio frame is determined according to the speech energy distribution data for the current audio frame. In this disclosure, the first gain may also be referred to as a primary gain. As an example, from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than the preset threshold, when the accumulated sum is equal to the preset threshold , the upper limit of the energy interval accumulated to meet the accumulated sum equal to the preset threshold is taken as the first energy limit; when the accumulated sum is greater than the preset threshold, the lower limit of the energy interval accumulated to meet the accumulated sum greater than the preset threshold is taken as The first energy limit.
接下来,根据当前音频帧的目标能量和第一能量界限来确定初始第一增益。根据当前音频帧的类型确定当前音频帧对应的帧数,通过对当前音频帧对应的帧数与预设帧数进行比较来调整初始第一增益并且将调整后的初始第一增益作为第一增益。这里,预设阈值和目标能量可根据实际情况而被不同地设置。Next, an initial first gain is determined according to the target energy of the current audio frame and the first energy limit. Determine the frame number corresponding to the current audio frame according to the type of the current audio frame, adjust the initial first gain by comparing the frame number corresponding to the current audio frame with the preset frame number, and use the adjusted initial first gain as the first gain . Here, the preset threshold and the target energy may be set differently according to actual conditions.
此外,在调整初始第一增益以获得第一增益之前,可首先对初始第一增益进行增益范围控制,使得初始第一增益满足实际需求。In addition, before the initial first gain is adjusted to obtain the first gain, a gain range control may be first performed on the initial first gain so that the initial first gain meets actual requirements.
在步骤S105,对当前音频帧应用第一增益以获得第一音频帧。在得到第一增益后,可将第一增益施加到原始当前音频帧上,即可得到第一音频信号。In step S105, a first gain is applied to the current audio frame to obtain a first audio frame. After the first gain is obtained, the first gain can be applied to the original current audio frame to obtain the first audio signal.
根据本公开的实施例,在对原始当前音频帧施加第一增益后,可对施加了第一增益的音频帧再次施加第二增益,即采用两级增益融合的方式,来得 到最终的音频信号。在本公开中,第二增益也可被称为二级增益。According to an embodiment of the present disclosure, after the first gain is applied to the original current audio frame, the second gain can be applied to the audio frame to which the first gain has been applied, that is, a two-stage gain fusion method is adopted to obtain the final audio signal . In this disclosure, the second gain may also be referred to as a secondary gain.
作为示例,可基于第一增益和当前音频帧的能量来确定用于当前音频帧的第二增益,然后对第一音频帧应用第二增益以获得第二音频帧。As an example, the second gain for the current audio frame may be determined based on the first gain and the energy of the current audio frame, and then the second gain may be applied to the first audio frame to obtain the second audio frame.
例如,可基于第一增益和当前音频帧的能量确定第二能量界限,根据当前音频帧的目标能量和第二能量界限来确定初始第二增益。此外,可先对第二能量界限进行平滑处理,然后使用平滑处理的第二能量界限和当前音频帧的目标能量来确定初始第二增益。按照当前音频帧中的音频采样点将初始第二增益改变为第二增益矢量。这里,可基于当前音频帧的前一音频帧中最后一个音频采样点的增益以及初始第二增益分别计算针对当前音频帧中的每个音频采样点的增益,以生成第二增益矢量。此外,在生成第二增益矢量之前,可首先对初始第二增益进行增益范围控制,使得初始第二增益满足实际需求。For example, the second energy limit may be determined based on the first gain and the energy of the current audio frame, and the initial second gain may be determined according to the target energy of the current audio frame and the second energy limit. In addition, smoothing may be performed on the second energy boundary first, and then the initial second gain may be determined using the smoothed second energy boundary and the target energy of the current audio frame. The initial second gain is changed into a second gain vector according to the audio sample point in the current audio frame. Here, the gain for each audio sample point in the current audio frame may be calculated based on the gain of the last audio sample point in the previous audio frame of the current audio frame and the initial second gain, so as to generate the second gain vector. In addition, before generating the second gain vector, a gain range control may be first performed on the initial second gain, so that the initial second gain meets actual requirements.
接下来,可将第二增益矢量中的每个增益分别应用于第一音频帧的相应音频采样以获得第二音频帧,然后可对第二音频帧的幅度进行限幅处理以输出最终的音频信号。下面将参照图2更加详细地描述根据本公开的实施例的音频处理方法。Next, each gain in the second gain vector can be applied to the corresponding audio samples of the first audio frame to obtain the second audio frame, and then the amplitude of the second audio frame can be clipped to output the final audio Signal. The audio processing method according to an embodiment of the present disclosure will be described in more detail below with reference to FIG. 2 .
图2是根据本公开的实施例的音频处理方法的流程示意图。根据本公开的实施例,可对输入音频进行分帧处理,然后对输入音频的每个音频帧应用图2所示的音频处理方法。Fig. 2 is a schematic flowchart of an audio processing method according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, the input audio can be processed by frame division, and then the audio processing method shown in FIG. 2 can be applied to each audio frame of the input audio.
在本公开的音频处理流程中,可使用语音能量计算模块、语音活动检测模块、语音能量直方图统计模块、动态预制增益(一级增益)计算模块、二级增益计算模块以及限幅器模块来实现本公开的音频处理方法。In the audio processing flow of the present disclosure, a voice energy calculation module, a voice activity detection module, a voice energy histogram statistics module, a dynamic prefabricated gain (first-stage gain) calculation module, a second-order gain calculation module and a limiter module can be used to The audio processing method of the present disclosure is implemented.
例如,语音能量计算模块用于计算当前输入音频的能量,语音活动检测模块用于判断当前时间的音频是语音阶段还是非语音阶段(诸如噪声或无声等),语音能量直方图统计模块用于统计过去一段时间的语音能量分布,动态预制增益(一级增益)计算模块根据语音能量分布数据和语音活动检测结果计算得到当前需要给输入音频施加的动态预制增益(即,一级增益),并将增益施加在当前输入的音频上,二级增益计算模块在施加了一级增益的音频基础上进一步调整音频增益,限幅器模块保护音频在某些极端情况下不会发生截波失真。For example, the voice energy calculation module is used to calculate the energy of the current input audio, the voice activity detection module is used to judge whether the audio at the current time is a speech stage or a non-speech stage (such as noise or silence, etc.), and the speech energy histogram statistics module is used for statistics The voice energy distribution of the past period of time, the dynamic prefabricated gain (first-order gain) calculation module calculates the dynamic prefabricated gain (that is, the first-order gain) that needs to be applied to the input audio at present according to the voice energy distribution data and the voice activity detection result, and The gain is applied to the currently input audio, the second-level gain calculation module further adjusts the audio gain on the basis of the audio with the first-level gain applied, and the limiter module protects the audio from clipping distortion in some extreme cases.
参照图2,对输入音频进行分帧处理,当前音频帧(假设是第n帧音频)用x(n)表示,其中n∈N,每个音频帧所包含的数据可选择在10-25ms时间内 的信号采样点数,即x(n)为由若干长度的音频采样点组成的向量。Referring to Figure 2, the input audio is divided into frames, and the current audio frame (assumed to be the nth frame of audio) is represented by x(n), where n∈N, the data contained in each audio frame can be selected within 10-25ms The number of signal sampling points in , that is, x(n) is a vector composed of several lengths of audio sampling points.
将x(n)输入到语音能量计算模块,可按照等式(1)计算第n帧音频的能量:Input x(n) into the speech energy calculation module, and the energy of the nth frame of audio can be calculated according to equation (1):
Figure PCTCN2022107039-appb-000001
Figure PCTCN2022107039-appb-000001
其中,M为x(n)中包含的音频采样点个数,该能量单位为dBFS,计算结果的取值范围可以是(-∞,0]。Wherein, M is the number of audio sampling points contained in x(n), the energy unit is dBFS, and the value range of the calculation result can be (-∞,0].
将x(n)输入到语音活动检测模块,可判断当前第n帧的音频是处于语音阶段还是非语音阶段(噪声和无声等情况),两种状态可分别如下等式(2)表示:Input x(n) to the voice activity detection module, it can be judged whether the audio frequency of the current nth frame is in the speech stage or the non-speech stage (noise and silence, etc.), and the two states can be represented by the following equation (2) respectively:
Figure PCTCN2022107039-appb-000002
Figure PCTCN2022107039-appb-000002
其中,当vad(n)为1时表示当前帧为语音帧(speech active),当vad(n)为0时表示当前帧为非语音帧(speech inactive),这里,不对VAD算法做任何限制。Wherein, when vad(n) is 1, it means that the current frame is a speech frame (speech active), when vad(n) is 0, it means that the current frame is a non-speech frame (speech inactive), here, the VAD algorithm is not restricted in any way.
接下来,将energyraw(n)和vad(n)输入到语音能量直方图统计模块,可统计过去一段时间(可根据实际情况备不同地设置)内的语音能量,语音能量直方图的横坐标为不同的能量区间,每个能量区间的宽度可以为1dB,其纵坐标为一段时间内处于各个能量区间的语音帧所占的比例,当前第n帧音频的语音能量分布数据可由HistogramEnergy(n)表示,统计的方法如下所述。Next, input energyraw(n) and vad(n) into the speech energy histogram statistical module, and the speech energy in the past period of time (can be set differently according to the actual situation) can be counted. The abscissa of the speech energy histogram is For different energy intervals, the width of each energy interval can be 1dB, and its ordinate is the proportion of speech frames in each energy interval for a period of time. The speech energy distribution data of the current nth frame of audio can be represented by HistogramEnergy(n) , the statistical method is as follows.
首先,可将HistogramEnergy(n)分为若干个能量区间,这里,以能量区间个数为100并且能量区间的宽度为1dB为例,然而本公开可根据实际需求调整,不限于此,如此划分能量区间的HistogramEnergy(n)可被表示为等式(3):First, HistogramEnergy(n) can be divided into several energy intervals. Here, the number of energy intervals is 100 and the width of the energy interval is 1dB as an example. However, this disclosure can be adjusted according to actual needs, and is not limited to this. The energy is divided in this way The HistogramEnergy(n) of the interval can be expressed as equation (3):
HistogramEnergy(n)=[e 1(n),e 2(n),……,e 100(n)]           (3) HistogramEnergy(n)=[e 1 (n), e 2 (n),..., e 100 (n)] (3)
每个能量区间角标对应的能量依次升高,对应关系如下:The energy corresponding to each energy interval subscript increases in turn, and the corresponding relationship is as follows:
Figure PCTCN2022107039-appb-000003
Figure PCTCN2022107039-appb-000003
语音能量分布数据的各能量区间的初始概率(即初始比例)可被设置为均匀分布,以100个能量区间为例:The initial probability (that is, the initial ratio) of each energy interval of the speech energy distribution data can be set to a uniform distribution, taking 100 energy intervals as an example:
Figure PCTCN2022107039-appb-000004
Figure PCTCN2022107039-appb-000004
当vad(n)=0或energyraw(n)<noisefloor时,即表示当前帧为非语音帧或音频能量小于噪声门限noisefloor(该数值可被设置为-50dBFS,但不限于此),当前第n帧音频的能量可不参与语音能量分布数据的统计,例如,可将当前音频帧的前一音频帧的语音能量分布数据作为当前音频帧的语音能量分布数据,即HistogramEnergy(n)=HistogramEnergy(n-1)。When vad(n)=0 or energyraw(n)<noisefloor, it means that the current frame is a non-speech frame or the audio energy is less than the noise threshold noisefloor (this value can be set to -50dBFS, but not limited to this), the current nth The energy of the frame audio can not participate in the statistics of the speech energy distribution data, for example, the speech energy distribution data of the previous audio frame of the current audio frame can be used as the speech energy distribution data of the current audio frame, namely HistogramEnergy (n)=HistogramEnergy (n- 1).
当vad(n)=1并且energyraw(n)≥noisefloor时,即表示当前帧为语音帧,可进行HistogramEnergy(n)更新。例如,可确定当前音频帧的能量在语音能量分布数据中所属的能量区间,增大当前音频帧的前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例,并且减小前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。When vad(n)=1 and energyraw(n)≥noisefloor, it means that the current frame is a speech frame, and HistogramEnergy(n) can be updated. For example, it is possible to determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, and increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame of the current audio frame , and reduce the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
例如,首先确认当前音频帧的能量energyraw(n)在等式(3)中的能量区间角标,记作ex(n),更新语音能量分布数据的方式可被表示为等式(4):For example, first confirm the energy energyraw(n) of the current audio frame in the energy interval corner mark in equation (3), denoted as ex(n), the way to update the speech energy distribution data can be expressed as equation (4):
Figure PCTCN2022107039-appb-000005
Figure PCTCN2022107039-appb-000005
其中,histSmooth为用于语音能量分布数据统计的平滑因子,该平滑因子可被设置为0.95,但可根据需求以及实际情况进行调整,也可采用根据energyraw(n)在不同的能量区间来选取不同的参数等方法,上述示例仅是示例性的,本公开不限于此。Among them, histSmooth is the smoothing factor used for speech energy distribution data statistics. The smoothing factor can be set to 0.95, but it can be adjusted according to the demand and actual situation, or different energyraw(n) can be selected in different energy intervals according to energyraw(n). Parameters and other methods, the above examples are only exemplary, and the present disclosure is not limited thereto.
此外,由于需要保证HistogramEnergy(n)中所有概率和为1,需要计算以上步骤得到的语音能量分布数据概率之和,并计算该和与1的差值(即残差概率),将该差值分配到整个语音能量分布数据中。在一些实施例中,可根据下面的等式(5)计算残差概率residualPro(n):In addition, since it is necessary to ensure that the sum of all probabilities in HistogramEnergy(n) is 1, it is necessary to calculate the sum of the speech energy distribution data probabilities obtained in the above steps, and calculate the difference between the sum and 1 (ie, the residual probability), and the difference assigned to the entire speech energy distribution data. In some embodiments, the residual probability residualPro(n) may be calculated according to equation (5) below:
Figure PCTCN2022107039-appb-000006
Figure PCTCN2022107039-appb-000006
在得到残差概率后,可按下面的等式(6)来分配残差概率:After obtaining the residual probability, the residual probability can be assigned according to the following equation (6):
Figure PCTCN2022107039-appb-000007
Figure PCTCN2022107039-appb-000007
重复上述分配残差概率的步骤,直到residualPro(n)=0为止,此时HistogramEnergy(n)更新结束。The above steps of allocating residual probability are repeated until residualPro(n)=0, at which point the updating of HistogramEnergy(n) ends.
根据本公开的实施例,语音能量分布数据从首帧开始会一致更新,输入音频的每个音频帧可按照上面的等式(6)来更新相应的语音能量分布数据。According to the embodiment of the present disclosure, the speech energy distribution data will be updated consistently from the first frame, and each audio frame of the input audio can update the corresponding speech energy distribution data according to the above equation (6).
将更新后的HistogramEnergy(n)、vad(n)和x(n)输入到动态预设增益(一级增益)模块,可综合一段时间内的能量分布信息、静音检测信息来计算当前应施加在当前音频帧上的增益gainPre(n),从而得到施加了动态预设增益的音频xGainPre(n),计算方法如下所述。Input the updated HistogramEnergy(n), vad(n) and x(n) to the dynamic preset gain (first-stage gain) module, which can integrate the energy distribution information and silence detection information for a period of time to calculate the current Gain gainPre(n) on the current audio frame, so as to obtain the audio xGainPre(n) with dynamic preset gain applied, the calculation method is as follows.
根据vad(n)信息判断当前音频帧所处的状态,如下面的等式(7)所示:Judging the state of the current audio frame according to the vad(n) information, as shown in the following equation (7):
Figure PCTCN2022107039-appb-000008
Figure PCTCN2022107039-appb-000008
这里,当前音频帧所处的状态可指当前音频帧当前所对应的帧数。Here, the status of the current audio frame may refer to the current frame number corresponding to the current audio frame.
接下来,从针对当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到累加之和等于或大于预设阈值。当累加之和等于预设阈值时,将累加到满足累加之和等于预设阈值的能量区间的上限作为第一能量界限;当累加之和大于所述预设阈值时,将累加到满足累加之和大于预设阈值的能量区间的下限作为第一能量界限。Next, starting from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold. When the cumulative sum is equal to the preset threshold, the upper limit of the energy interval that meets the cumulative sum equal to the preset threshold will be used as the first energy limit; and the lower limit of the energy range greater than the preset threshold as the first energy limit.
作为示例,根据HistogramEnergy(n)统计分布在百分比阈值probThre(即预设阈值)以下的能量所在的界限energyLevel(即第一能量界限),即一段时间内统计的语音能量分布数据的能量中均在energyLevel以下,这里,probThre可被设置为95%,但不限于此。As an example, the limit energyLevel (ie the first energy limit) where the energy below the percentage threshold probThre (ie the preset threshold) is statistically distributed according to HistogramEnergy (n), that is, the energy of the speech energy distribution data statistically in a period of time is in Below energyLevel, here, probThre can be set to 95%, but not limited to this.
计算过程可参考如下方法:(i)首先设定probSum=0;(ii)依次累加HistogramEnergy(n)中的能量概率,即probSum=probSum+e i(n),(i=1,2,……,100);(iii)每累加一个能量概率,判断probSum与probThre的关系,若probSum<probThre,则继续累加下一个能量概率,若probSum=probThre,则energyLevel为e i(n)所处能量区间的上限,并停止计算,若probSum>probThre,则energyLevel为e i(n)所处能量区间的下限,并停止计算。 The calculation process can refer to the following method: (i) first set probSum=0; (ii) sequentially accumulate the energy probability in HistogramEnergy(n), that is, probSum=probSum+e i (n), (i=1,2,… ...,100); (iii) judge the relationship between probSum and probThre each time an energy probability is accumulated, if probSum<probThre, continue to accumulate the next energy probability, if probSum=probThre, then energyLevel is the energy of e i (n) The upper limit of the interval, and stop the calculation, if probSum>probThre, then energyLevel is the lower limit of the energy interval where e i (n) is located, and stop the calculation.
根据以上计算得到的energyLevel计算初始的动态预设增益(即初始第一增益)gainPreRaw,可被表示为等式(8):Calculate the initial dynamic preset gain (ie, the initial first gain) gainPreRaw based on the energyLevel calculated above, which can be expressed as equation (8):
gainPreRaw=EnergyTarget-energyLevel        (8)gainPreRaw=EnergyTarget-energyLevel (8)
其中,EnergyTarget为期望当前音频帧达到的能量,这里,可被设为-18dB,也可根据需求调整。此时得到的gainPreRaw需要进行一定的增益范围控制,根据实际情况,增益范围一般为[-6dB,12dB],也可根据需求调整,可根据下面的等式(9)对gainPreRaw进行调整:Among them, EnergyTarget is the energy expected to be achieved by the current audio frame, here, it can be set to -18dB, and can also be adjusted according to requirements. The gainPreRaw obtained at this time requires a certain gain range control. According to the actual situation, the gain range is generally [-6dB, 12dB], and it can also be adjusted according to the needs. The gainPreRaw can be adjusted according to the following equation (9):
Figure PCTCN2022107039-appb-000009
Figure PCTCN2022107039-appb-000009
根据以上计算的初始动态预设增益gainPreRaw和silenceState(n)计算需要为当前音频施加的动态预设增益gainPre(n)(即第一增益)。The dynamic preset gain gainPre(n) (that is, the first gain) to be applied to the current audio is calculated according to the above-calculated initial dynamic preset gain gainPreRaw and silenceState(n).
可通过对当前音频帧对应的帧数与预设帧数进行比较来调整初始第一增益并且将调整后的初始第一增益作为第一增益。方法可以如下所示。The initial first gain may be adjusted by comparing the frame number corresponding to the current audio frame with the preset frame number, and the adjusted initial first gain may be used as the first gain. The method can be as follows.
若silenceState(n)≥silThre(其中,silThre一段时间对应的音频帧数,该一段时间可以是1秒到2秒)时,则gainPre(n)=gainPreRaw;If silenceState(n)≥silThre (wherein, the number of audio frames corresponding to silThre for a period of time, the period of time can be 1 second to 2 seconds), then gainPre(n)=gainPreRaw;
若silenceState(n)<silThre且gainPreRaw≥gainPre(n-1),则gainPre(n)=gainPreRaw×(1-sAtt)+gainPre(n-1)×sAtt,其中,sAtt为跟进平滑因子,一般被设置为0.9999,也可根据实际情况设置;If silenceState(n)<silThre and gainPreRaw≥gainPre(n-1), then gainPre(n)=gainPreRaw×(1-sAtt)+gainPre(n-1)×sAtt, where sAtt is the follow-up smoothing factor, generally It is set to 0.9999, and it can also be set according to the actual situation;
若silenceState(n)<silThre且gainPreRaw<gainPre(n-1),则gainPre(n)=gainPreRaw×(1-sRel)+gainPre(n-1)×sRel,其中,sRel为释放平滑因子,一般被设置为0.99,也可根据实际情况设置。If silenceState(n)<silThre and gainPreRaw<gainPre(n-1), then gainPre(n)=gainPreRaw×(1-sRel)+gainPre(n-1)×sRel, where sRel is the release smoothing factor, which is generally It is set to 0.99, and it can also be set according to the actual situation.
得到第一增益gainPre(n)之后,将其施加在输入的原始音频x(n)上,如下面的等式(10)所示:After obtaining the first gain gainPre(n), apply it to the input original audio x(n), as shown in equation (10) below:
Figure PCTCN2022107039-appb-000010
Figure PCTCN2022107039-appb-000010
即可得到施加了动态预设增益(一级增益)的音频信号。An audio signal to which a dynamic preset gain (first-stage gain) is applied can be obtained.
将gainPre(n),xGainPre(n),energyraw(n)和vad(n)输入到二级增益计算模块,可进一步计算此时需要给当前音频帧施加的第二增益gainPost(n)。考虑到当前音频帧中的每个采样点对应的增益均有差别,可将上述增益为增益矢量。Input gainPre(n), xGainPre(n), energyraw(n) and vad(n) to the secondary gain calculation module to further calculate the second gain gainPost(n) that needs to be applied to the current audio frame at this time. Considering that the gain corresponding to each sampling point in the current audio frame is different, the above gain can be regarded as a gain vector.
在一些实施例中,可根据gainPre(n)和energyraw(n)得到施加了动态预设增益后的音频能量(即第二能量界限),如下面的等式(11)所示:In some embodiments, the audio energy (that is, the second energy limit) after the dynamic preset gain is applied can be obtained according to gainPre(n) and energyraw(n), as shown in the following equation (11):
energyGainPreRaw=gainPre(n)+energyraw(n)       (11)energyGainPreRaw=gainPre(n)+energyraw(n) (11)
对该音频能量进行平滑处理,如下面的等式(12)所示:This audio energy is smoothed as shown in equation (12) below:
energyGainPreSmooth(n)=energyGainPreSmooth(n-1)×smoothEnergy+energyGainPreRaw×(1-smoothEnergy)            (12)energyGainPreSmooth(n)=energyGainPreSmooth(n-1)×smoothEnergy+energyGainPreRaw×(1-smoothEnergy) (12)
其中,smoothEnergy表示平滑因子,energyGainPreSmooth(n)表示当前音频帧经过平滑处理的音频能量,energyGainPreSmooth(n-1)为前一音频 帧经过平滑处理的音频能量。在当前音频帧为首帧的情况下,energyGainPreSmooth(n-1)可被设置为零。Among them, smoothEnergy represents the smoothing factor, energyGainPreSmooth(n) represents the smoothed audio energy of the current audio frame, and energyGainPreSmooth(n-1) represents the smoothed audio energy of the previous audio frame. In case the current audio frame is the first frame, energyGainPreSmooth(n-1) may be set to zero.
根据energyGainPreSmooth(n)和EnergyTarget可计算当前音频帧的二级增益期望数值(即初始第二增益),如下面的等式(13)所示:According to energyGainPreSmooth(n) and EnergyTarget, the expected value of the secondary gain (ie, the initial second gain) of the current audio frame can be calculated, as shown in the following equation (13):
gainPostRaw=EnergyTarget-energyGainPreSmooth(n)        (13)gainPostRaw=EnergyTarget-energyGainPreSmooth(n) (13)
与gainPreRaw的计算类似,gainPostRaw同样需要对增益范围进行限制:Similar to the calculation of gainPreRaw, gainPostRaw also needs to limit the gain range:
Figure PCTCN2022107039-appb-000011
Figure PCTCN2022107039-appb-000011
根据上述经增益控制处理后的增益可得到当前音频帧的增益矢量gainPost(n),该矢量与xGainPre(n)维度相同,即若其均含有M个元素,则以上增益矢量和xGainPre(n)可表示为如下形式:The gain vector gainPost(n) of the current audio frame can be obtained according to the above-mentioned gain after gain control processing. Can be expressed as the following form:
gainPost(n)=[gainPost 1(n),gainPost 2(n),……,gainPost M(n)] T gainPost(n)=[gainPost 1 (n), gainPost 2 (n),..., gainPost M (n)] T
xGainPre(n)=[xGainPre 1(n),xGainPre 2(n),……,xGainPre M(n)] T xGainPre(n)=[xGainPre 1 (n),xGainPre 2 (n),...,xGainPre M (n)] T
可根据如下等式(14)计算当前二级增益矢量中的每个元素:Each element in the current secondary gain vector can be calculated according to the following equation (14):
Figure PCTCN2022107039-appb-000012
Figure PCTCN2022107039-appb-000012
其中,i表示当前二级增益矢量中的第i元素,gainPost M(n-1)表示前一音频帧的第M采样点(即前一音频帧的最后一个采样点)的增益。 Wherein, i represents the i-th element in the current secondary gain vector, and gainPost M (n-1) represents the gain of the Mth sampling point of the previous audio frame (ie, the last sampling point of the previous audio frame).
以上得到的二级增益矢量gainPost(n)与xGainPre(n)的对应元素相乘(需要注意是,需要将增益单位从dBFS转换为线性增益),即可得到施加了二级增益的音频信号xGainPost(n),如下面的等式(15)所示:The secondary gain vector gainPost(n) obtained above is multiplied by the corresponding element of xGainPre(n) (it should be noted that the gain unit needs to be converted from dBFS to linear gain), and the audio signal xGainPost with the secondary gain applied can be obtained (n), as shown in equation (15) below:
Figure PCTCN2022107039-appb-000013
Figure PCTCN2022107039-appb-000013
将以上得到的输出音频xGainPost(n)输入到限幅器模块,以保证音频不会发生截波失真,如下面的等式(16)所示:The output audio xGainPost(n) obtained above is input to the limiter module to ensure that the audio will not be clipped and distorted, as shown in equation (16) below:
y(n)=Limiter[xGainPost(n)]           (16)y(n)=Limiter[xGainPost(n)] (16)
其中,Limiter[*]表示对输入信号的幅度做限幅保护,y(n)为最终输出的经过AGC处理后的一帧音频信号。Among them, Limiter[*] indicates that the amplitude of the input signal is limited and protected, and y(n) is the final output of a frame of audio signal after AGC processing.
图3是根据本公开的实施例的音频处理装置的框图。FIG. 3 is a block diagram of an audio processing device according to an embodiment of the present disclosure.
参照图3,音频处理装置300可包括获取模块301、确定模块302、第一增益模块303、第二增益模块304。音频处理装置300中的每个模块可由一个 或多个模块来实现,并且对应模块的名称可根据模块的类型而变化。在各种实施例中,可省略音频处理装置300中的一些模块,或者还可包括另外的模块。此外,根据本公开的各种实施例的模块/元件可被组合以形成单个实体,并且因此可等效地执行相应模块/元件在组合之前的功能。Referring to FIG. 3 , the audio processing device 300 may include an acquisition module 301 , a determination module 302 , a first gain module 303 , and a second gain module 304 . Each module in the audio processing device 300 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the type of the module. In various embodiments, some modules in the audio processing device 300 may be omitted, or additional modules may also be included. Also, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the corresponding modules/elements before combination.
获取模块301可获取待处理的当前音频帧。The acquiring module 301 can acquire the current audio frame to be processed.
确定模块302可确定当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一。The determining module 302 can determine the energy and type of the current audio frame, the type including one of a speech frame and a non-speech frame.
确定模块302可基于当前音频帧的能量和类型来获得针对当前音频帧的语音能量分布数据,其中,语音能量分布数据可用于统计不同能量区间的语音帧所占的比例。The determining module 302 can obtain speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data can be used to count the proportions of speech frames in different energy intervals.
第一增益模块303可根据针对当前音频帧的语音能量分布数据来确定用于当前音频帧的第一增益,并且对当前音频帧应用第一增益以获得第一音频帧。The first gain module 303 can determine the first gain for the current audio frame according to the speech energy distribution data for the current audio frame, and apply the first gain to the current audio frame to obtain the first audio frame.
第二增益模块304可基于第一增益和当前音频帧的能量来确定用于当前音频帧的第二增益,对第一音频帧应用第二增益以获得第二音频帧。The second gain module 304 may determine a second gain for the current audio frame based on the first gain and the energy of the current audio frame, and apply the second gain to the first audio frame to obtain the second audio frame.
在当前音频帧的能量小于预设噪声门限或者当前音频帧是非语音帧时,确定模块302可将当前音频帧的前一音频帧的语音能量分布数据作为当前音频帧的语音能量分布数据。在当前音频帧的能量大于或等于预设噪声门限并且当前音频帧是语音帧时,确定模块302可基于当前音频帧的能量更新前一音频帧的语音能量分布数据,其中,在当前音频帧是首帧时,确定模块302可基于当前音频帧的能量更新初始语音能量分布数据,初始语音能量分布数据的各个能量区间均匀分布语音帧所占的比例。When the energy of the current audio frame is less than the preset noise threshold or the current audio frame is a non-speech frame, the determination module 302 may use the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution data of the current audio frame. When the energy of the current audio frame is greater than or equal to the preset noise threshold and the current audio frame is a speech frame, the determination module 302 may update the speech energy distribution data of the previous audio frame based on the energy of the current audio frame, wherein the current audio frame is For the first frame, the determining module 302 may update the initial speech energy distribution data based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is evenly distributed with a proportion of the speech frame.
确定模块302可确定当前音频帧的能量在语音能量分布数据中所属的能量区间,增大前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例,并且减小前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。The determination module 302 may determine the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data, increase the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame, and decrease The ratio of speech frames in the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
确定模块302可计算更新后的语音能量分布数据中的各个能量区间的语音帧比例之和,通过对语音帧比例之和与预设值进行比较来确定残差概率,将残差该概率分配到更新后的语音能量分布数据的的各个能量区间,直到更新后的语音能量分布数据的各个能量区间的语音帧所占的比例之和为预设值。例如,预设值可被设置为1。The determination module 302 can calculate the sum of the speech frame proportions of each energy interval in the updated speech energy distribution data, determine the residual probability by comparing the sum of the speech frame proportions with the preset value, and distribute the residual probability to Each energy range of the updated speech energy distribution data, until the sum of the proportions of speech frames in each energy range of the updated speech energy distribution data is a preset value. For example, the preset value can be set to 1.
第一增益模块303可从针对当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到所述累加之和等于或大于预设阈值。当所述累加之和等于预设阈值时,第一增益模块303可将累加到满足所述累加之和等于预设阈值的能量区间的上限作为第一能量界限。当所述累加之和大于预设阈值时,第一增益模块303可将累加到满足所述累加之和大于预设阈值的能量区间的下限作为第一能量界限。然后第一增益模块303可根据当前音频帧的目标能量和第一能量界限来确定第一增益。The first gain module 303 may sequentially accumulate speech frame proportions of each energy interval starting from the first energy interval of the speech energy distribution data of the current audio frame until the accumulated sum is equal to or greater than a preset threshold. When the accumulated sum is equal to the preset threshold, the first gain module 303 may use the upper limit of the energy range accumulated to satisfy the accumulated sum equal to the preset threshold as the first energy limit. When the accumulated sum is greater than the preset threshold, the first gain module 303 may use the lower limit of the energy range that is accumulated to meet the requirement that the accumulated sum is greater than the preset threshold as the first energy limit. Then the first gain module 303 can determine the first gain according to the target energy of the current audio frame and the first energy limit.
第一增益模块303可根据当前音频帧的目标能量和第一能量界限来确定初始第一增益,根据当前音频帧的类型确定当前音频帧对应的帧数,通过对当前音频帧对应的帧数与预设帧数进行比较来调整初始第一增益并且将调整后的初始第一增益作为第一增益。The first gain module 303 can determine the initial first gain according to the target energy of the current audio frame and the first energy limit, and determine the frame number corresponding to the current audio frame according to the type of the current audio frame, by comparing the frame number corresponding to the current audio frame with the The preset frame numbers are compared to adjust the initial first gain and the adjusted initial first gain is used as the first gain.
第二增益模块304可基于第一增益和当前音频帧的能量确定第二能量界限,根据当前音频帧的目标能量和第二能量界限来确定初始第二增益,按照当前音频帧中的音频采样点将初始第二增益改变为第二增益矢量。The second gain module 304 may determine the second energy limit based on the first gain and the energy of the current audio frame, determine the initial second gain according to the target energy of the current audio frame and the second energy limit, and determine the audio sampling point in the current audio frame Change the initial second gain to the second gain vector.
第二增益模块304可基于当前音频帧的前一音频帧中最后一个音频采样点的增益以及初始第二增益分别计算针对当前音频帧中的每个音频采样点的增益,以生成第二增益矢量。The second gain module 304 can calculate the gain for each audio sampling point in the current audio frame based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, so as to generate a second gain vector .
第二增益模块304可将第二增益矢量中的每个增益分别应用于第一音频帧的相应音频采样点,以获得第二音频帧,并且对第二音频帧的幅度进行限幅处理。The second gain module 304 can apply each gain in the second gain vector to corresponding audio sampling points of the first audio frame to obtain a second audio frame, and perform clipping processing on the amplitude of the second audio frame.
上面已根据图1和图2详细描述了根据本公开实施例的自动增益控制过程,这里不再进行描述。The automatic gain control process according to the embodiment of the present disclosure has been described in detail above with reference to FIG. 1 and FIG. 2 , and will not be described here again.
图4是本公开实施例的硬件运行环境的音频处理设备的结构示意图。Fig. 4 is a schematic structural diagram of an audio processing device in a hardware operating environment according to an embodiment of the present disclosure.
如图4所示,音频处理设备400可包括:处理组件401、通信总线402、网络接口403、输入输出接口404、存储器405以及电源组件404。其中,通信总线402用于实现这些组件之间的连接通信。输入输出接口404可以包括视频显示器(诸如,液晶显示器)、麦克风和扬声器以及用户交互接口(诸如,键盘、鼠标、触摸输入装置等),在一些实施例中,输入输出接口404还可包括标准的有线接口、无线接口。网络接口403可包括标准的有线接口、无线接口(如无线保真接口)。存储器405可以是高速的随机存取存储器,也可以是稳定的非易失性存储器。存储器405还可以是独立于前述处理组件401的 存储装置。As shown in FIG. 4 , the audio processing device 400 may include: a processing component 401 , a communication bus 402 , a network interface 403 , an input and output interface 404 , a memory 405 and a power supply component 404 . Wherein, the communication bus 402 is used to realize connection and communication between these components. The input and output interface 404 may include a video display (such as a liquid crystal display), a microphone and a speaker, and a user interaction interface (such as a keyboard, mouse, touch input device, etc.), and in some embodiments, the input and output interface 404 may also include a standard Wired interface, wireless interface. The network interface 403 may include a standard wired interface and a wireless interface (such as a Wi-Fi interface). The memory 405 can be a high-speed random access memory, or a stable non-volatile memory. The memory 405 may also be a storage device independent of the aforementioned processing component 401.
本领域技术人员可以理解,图4中示出的结构并不构成对音频处理设备400的限定,可包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation to the audio processing device 400, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
如图4所示,作为一种存储介质的存储器405中可包括操作系统(诸如MAC操作系统)、数据存储模块、网络通信模块、用户接口模块、音频处理程序以及数据库。As shown in FIG. 4 , memory 405 as a storage medium may include an operating system (such as MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.
在图4所示的音频处理设备400中,网络接口403主要用于与外部电子设备/终端进行数据通信;输入输出接口404主要用于与用户进行数据交互;音频处理设备400中的处理组件401、存储器405可被设置在音频处理设备400中,音频处理设备400通过处理组件401调用存储器405中存储的音频处理程序、素材以及由操作系统提供的各种API,执行本公开实施例提供的音频处理方法。In the audio processing device 400 shown in Figure 4, the network interface 403 is mainly used for data communication with external electronic devices/terminals; the input and output interface 404 is mainly used for data interaction with the user; the processing component 401 in the audio processing device 400 , the memory 405 can be set in the audio processing device 400, and the audio processing device 400 calls the audio processing programs, materials and various APIs stored in the memory 405 through the processing component 401, and executes the audio processing provided by the embodiment of the present disclosure. Approach.
处理组件401可以包括至少一个处理器,存储器405中存储有计算机可执行指令集合,当计算机可执行指令集合被至少一个处理器执行时,执行根据本公开实施例的音频处理方法。然而,上述示例仅是示例性的,本公开不限于此。The processing component 401 may include at least one processor, and a set of computer-executable instructions is stored in the memory 405. When the set of computer-executable instructions is executed by the at least one processor, the audio processing method according to the embodiment of the present disclosure is executed. However, the above-mentioned examples are only exemplary, and the present disclosure is not limited thereto.
处理组件401可获取待处理的当前音频帧,确定当前音频帧的能量和类型,基于当前音频帧的能量和类型来获得针对当前音频帧的语音能量分布数据,根据针对当前音频帧的语音能量分布数据来确定用于当前音频帧的第一增益,对当前音频帧应用第一增益以获得第一音频帧,然后可基于第一增益和当前音频帧的能量来确定用于当前音频帧的第二增益,对第一音频帧应用第二增益以获得第二音频帧。The processing component 401 can obtain the current audio frame to be processed, determine the energy and type of the current audio frame, obtain the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, and obtain the speech energy distribution data for the current audio frame according to the speech energy distribution for the current audio frame data to determine a first gain for the current audio frame, apply the first gain to the current audio frame to obtain the first audio frame, and then determine a second gain for the current audio frame based on the first gain and the energy of the current audio frame Gain, to apply a second gain to the first audio frame to obtain a second audio frame.
处理组件401可通过执行程序来实现对音频处理设备400所包括的组件的控制。The processing component 401 can realize control of components included in the audio processing device 400 by executing a program.
音频处理设备400可经由输入输出接口404接收或输出视频和/或音频。例如,音频处理设备400可经由输入输出接口404输出施加增益后的音频信号。The audio processing device 400 may receive or output video and/or audio via the input-output interface 404 . For example, the audio processing device 400 can output the audio signal after the gain is applied via the input and output interface 404 .
作为示例,音频处理设备400可以是PC计算机、平板装置、个人数字助理、智能手机、或其他能够执行上述指令集合的装置。这里,音频处理设备400并非必须是单个的电子设备,还可以是任何能够单独或联合执行上述 指令(或指令集)的装置或电路的集合体。音频处理设备400还可以是集成控制系统或系统管理器的一部分,或者可以被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子设备。As an example, the audio processing device 400 may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above-mentioned set of instructions. Here, the audio processing device 400 does not have to be a single electronic device, but can also be any assembly of devices or circuits capable of individually or jointly executing the above-mentioned instructions (or instruction sets). The audio processing device 400 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces locally or remotely (eg, via wireless transmission).
在音频处理设备400中,处理组件401可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理组件401还可以包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。In the audio processing device 400, the processing component 401 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller or a microprocessor. By way of example and not limitation, the processing component 401 may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
处理组件401可运行存储在存储器中的指令或代码,其中,存储器405还可以存储数据。指令和数据还可以经由网络接口403而通过网络被发送和接收,其中,网络接口403可以采用任何已知的传输协议。 Processing component 401 may execute instructions or codes stored in memory, where memory 405 may also store data. Instructions and data can also be sent and received via the network via the network interface 403, wherein the network interface 403 can adopt any known transmission protocol.
存储器405可以与处理组件401集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储器405可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可以使用的其他存储装置。存储器和处理组件401可以在操作上进行耦合,或者可以例如通过I/O端口、网络连接等互相通信,使得处理组件401能够读取存储在存储器405中的数据。The memory 405 may be integrated with the processing component 401, for example, by placing RAM or flash memory within an integrated circuit microprocessor or the like. Additionally, storage 405 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. The memory and processing component 401 may be operatively coupled, or may communicate with each other, eg, through I/O ports, network connections, etc., such that the processing component 401 can read data stored in the memory 405 .
根据本公开的实施例,可提供一种电子设备。图5是根据本公开实施例的电子设备的框图,该电子设备500可包括至少一个存储器502和至少一个处理器501,所述至少一个存储器502存储有计算机可执行指令集合,当计算机可执行指令集合被至少一个处理器501执行时,执行根据本公开实施例的音频处理方法。According to an embodiment of the present disclosure, there may be provided an electronic device. 5 is a block diagram of an electronic device according to an embodiment of the present disclosure. The electronic device 500 may include at least one memory 502 and at least one processor 501. The at least one memory 502 stores a set of computer-executable instructions. When the computer-executable instructions When the set is executed by at least one processor 501, the audio processing method according to the embodiment of the present disclosure is executed.
处理器501可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器501还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。Processor 501 may include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example and not limitation, the processor 501 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
作为一种存储介质的存储器502可包括操作系统(例如,MAC操作系统)、数据存储模块、网络通信模块、用户接口模块、音频处理程序以及数据库。The memory 502 as a storage medium may include an operating system (eg, MAC operating system), a data storage module, a network communication module, a user interface module, an audio processing program, and a database.
存储器502可与处理器501集成为一体,例如,可将RAM或闪存布置在集成电路微处理器等之内。此外,存储器502可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储器502和处理器501可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器501能够读取存储在存储器502中的文件。The memory 502 can be integrated with the processor 501, for example, RAM or flash memory can be arranged in an integrated circuit microprocessor or the like. Additionally, memory 502 may comprise a separate device, such as an external disk drive, storage array, or any other storage device usable by the database system. Memory 502 and processor 501 may be operatively coupled, or may communicate with each other, such as through an I/O port, network connection, etc., such that processor 501 can read files stored in memory 502 .
此外,电子设备500还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。电子设备500的所有组件可经由总线和/或网络而彼此连接。In addition, the electronic device 500 may further include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 500 may be connected to each other via a bus and/or a network.
本领域技术人员可理解,图5中示出的结构并不构成对的限定,可包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation, and may include more or less components than shown in the figure, or combine some components, or arrange different components.
根据本公开的实施例,还可提供一种存储指令的计算机可读存储介质,其中,当指令被至少一个处理器运行时,促使至少一个处理器执行根据本公开的音频处理方法。这里的计算机可读存储介质的示例包括:只读存储器(ROM)、随机存取可编程只读存储器(PROM)、电可擦除可编程只读存储器(EEPROM)、随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、闪存、非易失性存储器、CD-ROM、CD-R、CD+R、CD-RW、CD+RW、DVD-ROM、DVD-R、DVD+R、DVD-RW、DVD+RW、DVD-RAM、BD-ROM、BD-R、BD-R LTH、BD-RE、蓝光或光盘存储器、硬盘驱动器(HDD)、固态硬盘(SSD)、卡式存储器(诸如,多媒体卡、安全数字(SD)卡或极速数字(XD)卡)、磁带、软盘、磁光数据存储装置、光学数据存储装置、硬盘、固态盘以及任何其他装置,所述任何其他装置被配置为以非暂时性方式存储计算机程序以及任何相关联的数据、数据文件和数据结构并将所述计算机程序以及任何相关联的数据、数据文件和数据结构提供给处理器或计算机使得处理器或计算机能执行所述计算机程序。上述计算机可读存储介质中的计算机程序可在诸如客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,此外,在一个示例中,计算机程序以及任何相关联的数据、数据文件和数据结构分布在联网的计算机系统上,使得计算机程序以及任何相关联的数据、数据文件和数据结构通过一个或多个处理器或计算机以分布式方式存储、访问和执行。According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein, when the instructions are executed by at least one processor, at least one processor is prompted to execute the audio processing method according to the present disclosure. Examples of computer readable storage media herein include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM) , Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), Flash Memory, Non-volatile Memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM , DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or Optical Memory, Hard Disk Drive (HDD), Solid State Hard disks (SSD), memory cards (such as MultiMediaCards, Secure Digital (SD) or Extreme Digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other means configured to store a computer program and any associated data, data files and data structures in a non-transitory manner and to provide said computer program and any associated data, data files and data structures to the processor or the computer to enable the processor or the computer to execute the computer program. The computer program in the above-mentioned computer-readable storage medium can run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. In addition, in one example, the computer program and any associated data and data files and data structures are distributed over network-connected computer systems so that the computer programs and any associated data, data files and data structures are stored, accessed and executed in a distributed fashion by one or more processors or computers.
根据本公开的实施例中,还可提供一种计算机程序产品,该计算机程序产品中的指令可由计算机设备的处理器执行以完成上述音频处理方法。According to an embodiment of the present disclosure, a computer program product may also be provided, and instructions in the computer program product may be executed by a processor of a computer device to complete the above audio processing method.
本领域技术人员在考虑说明书及实践这里公开的内容后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any modification, use or adaptation of the present disclosure, and these modifications, uses or adaptations follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure . The specification and examples are to be considered exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It should be understood that the present disclosure is not limited to the precise constructions which have been described above and shown in the drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (21)

  1. 一种音频处理方法,包括:An audio processing method comprising:
    获取待处理的当前音频帧;Get the current audio frame to be processed;
    确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;
    基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;Obtain speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, where the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;
    根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;
    对所述当前音频帧应用所述第一增益以获得第一音频帧。Applying the first gain to the current audio frame to obtain a first audio frame.
  2. 根据权利要求1所述的音频处理方法,其特征在于,基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,包括:The audio processing method according to claim 1, wherein obtaining the speech energy distribution data for the current audio frame based on the energy and type of the current audio frame comprises:
    响应于所述当前音频帧的能量小于预设噪声门限或者所述当前音频帧是非语音帧,将所述当前音频帧的前一音频帧的语音能量分布数据作为所述当前音频帧的语音能量分布数据;In response to the energy of the current audio frame being less than a preset noise threshold or the current audio frame is a non-speech frame, using the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution of the current audio frame data;
    响应于所述当前音频帧的能量大于或等于所述预设噪声门限并且所述当前音频帧是语音帧,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame in response to the energy of the current audio frame being greater than or equal to the preset noise threshold and the current audio frame being a speech frame,
    其中,响应于所述当前音频帧是首帧,基于所述当前音频帧的能量更新初始语音能量分布数据,所述初始语音能量分布数据的各个能量区间均匀分布语音帧所占的比例。Wherein, in response to the fact that the current audio frame is the first frame, the initial speech energy distribution data is updated based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is uniformly distributed with a proportion of the speech frame.
  3. 根据权利要求2所述的音频处理方法,其特征在于,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,包括:The audio processing method according to claim 2, wherein updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame comprises:
    确定所述当前音频帧的能量在语音能量分布数据中所属的能量区间;determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data;
    增大所述前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例;increasing the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame;
    减小所述前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
  4. 根据权利要求2或3所述的音频处理方法,其特征在于,基于所述当 前音频帧的能量更新所述前一音频帧的语音能量分布数据,包括:The audio processing method according to claim 2 or 3, wherein updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame comprises:
    计算更新后的语音能量分布数据中的各个能量区间的语音帧比例之和;Calculating the sum of the speech frame ratios of each energy interval in the updated speech energy distribution data;
    通过对所述语音帧比例之和与预设值进行比较来确定残差概率;determining the residual probability by comparing the sum of the speech frame ratios with a preset value;
    将所述残差概率分配到更新后的语音能量分布数据的的各个能量区间,直到更新后的语音能量分布数据的各个能量区间的语音帧所占的比例之和为所述预设值。Allocating the residual probability to each energy interval of the updated speech energy distribution data until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data reaches the preset value.
  5. 根据权利要求1所述的音频处理方法,其特征在于,根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益,包括:The audio processing method according to claim 1, wherein determining the first gain for the current audio frame according to the speech energy distribution data for the current audio frame comprises:
    从针对所述当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到所述累加之和等于或大于预设阈值;Starting from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold;
    响应于所述累加之和等于所述预设阈值,将累加到满足所述累加之和等于所述预设阈值的能量区间的上限作为第一能量界限;In response to the accumulated sum being equal to the preset threshold, taking the upper limit of the energy interval accumulated to satisfy the accumulated sum being equal to the preset threshold as a first energy limit;
    响应于所述累加之和大于所述预设阈值,将累加到满足所述累加之和大于所述预设阈值的能量区间的下限作为第一能量界限;In response to the accumulated sum being greater than the preset threshold, taking the lower limit of the energy interval accumulated to satisfy the accumulated sum being greater than the preset threshold as the first energy limit;
    根据所述当前音频帧的目标能量和所述第一能量界限来确定所述第一增益。The first gain is determined according to the target energy of the current audio frame and the first energy limit.
  6. 根据权利要求5所述的音频处理方法,其特征在于,根据所述当前音频帧的目标能量和所述第一能量界限来确定所述第一增益,包括:The audio processing method according to claim 5, wherein determining the first gain according to the target energy of the current audio frame and the first energy limit comprises:
    根据所述当前音频帧的目标能量和所述第一能量界限来确定初始第一增益;determining an initial first gain according to the target energy of the current audio frame and the first energy limit;
    根据所述当前音频帧的类型确定所述当前音频帧对应的帧数;determining the number of frames corresponding to the current audio frame according to the type of the current audio frame;
    通过对所述当前音频帧对应的帧数与预设帧数进行比较来调整所述初始第一增益并且将调整后的初始第一增益作为所述第一增益。The initial first gain is adjusted by comparing the frame number corresponding to the current audio frame with a preset frame number, and the adjusted initial first gain is used as the first gain.
  7. 根据权利要求1所述的音频处理方法,其特征在于,还包括:The audio processing method according to claim 1, further comprising:
    基于所述第一增益和所述当前音频帧的能量确定第二能量界限;determining a second energy bound based on the first gain and the energy of the current audio frame;
    根据所述当前音频帧的目标能量和所述第二能量界限来确定初始第二增益;determining an initial second gain based on the target energy of the current audio frame and the second energy limit;
    基于所述当前音频帧中的音频采样点和所述初始第二增益来得到第二增益矢量;obtaining a second gain vector based on the audio sampling points in the current audio frame and the initial second gain;
    对所述第一音频帧应用所述第二增益以获得第二音频帧。Applying the second gain to the first audio frame to obtain a second audio frame.
  8. 根据权利要求7所述的音频处理方法,其特征在于,基于所述当前音 频帧中的音频采样点和所述初始第二增益来得到第二增益矢量,包括:The audio processing method according to claim 7, wherein the second gain vector is obtained based on the audio sampling point in the current audio frame and the initial second gain, comprising:
    基于所述当前音频帧的前一音频帧中最后一个音频采样点的增益以及所述初始第二增益分别计算针对所述当前音频帧中的每个音频采样点的增益,以生成所述第二增益矢量。Based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively calculate the gain for each audio sampling point in the current audio frame to generate the second gain vector.
  9. 根据权利要求7所述的音频处理方法,其特征在于,对所述第一音频帧应用所述第二增益以获得第二音频帧,包括:The audio processing method according to claim 7, wherein applying the second gain to the first audio frame to obtain a second audio frame comprises:
    将所述第二增益矢量中的每个增益分别应用于所述第一音频帧的相应音频采样点,以获得第二音频帧;并且applying each gain in the second gain vector to corresponding audio samples of the first audio frame to obtain a second audio frame; and
    对所述第二音频帧的幅度进行限幅处理。Perform clipping processing on the amplitude of the second audio frame.
  10. 一种音频处理装置,包括:An audio processing device, comprising:
    获取模块,被配置为获取待处理的当前音频帧;An acquisition module configured to acquire the current audio frame to be processed;
    确定模块,被配置为确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;并且基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;A determination module configured to determine energy and a type of the current audio frame, the type including one of a speech frame and a non-speech frame; and obtain a value for the current audio frame based on the energy and type of the current audio frame Speech energy distribution data, wherein the speech energy distribution data is used to count the proportion of speech frames in different energy intervals;
    第一增益模块,被配置为根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;并且对所述当前音频帧应用所述第一增益以获得第一音频帧。A first gain module configured to determine a first gain for the current audio frame according to speech energy distribution data for the current audio frame; and apply the first gain to the current audio frame to obtain a first gain An audio frame.
  11. 根据权利要求10所述的音频处理装置,其特征在于,确定模块被配置为:The audio processing device according to claim 10, wherein the determining module is configured to:
    响应于所述当前音频帧的能量小于预设噪声门限或者所述当前音频帧是非语音帧,将所述当前音频帧的前一音频帧的语音能量分布数据作为所述当前音频帧的语音能量分布数据;In response to the energy of the current audio frame being less than a preset noise threshold or the current audio frame is a non-speech frame, using the speech energy distribution data of the previous audio frame of the current audio frame as the speech energy distribution of the current audio frame data;
    响应于所述当前音频帧的能量大于或等于所述预设噪声门限并且所述当前音频帧是语音帧,基于所述当前音频帧的能量更新所述前一音频帧的语音能量分布数据,updating the speech energy distribution data of the previous audio frame based on the energy of the current audio frame in response to the energy of the current audio frame being greater than or equal to the preset noise threshold and the current audio frame being a speech frame,
    其中,响应于所述当前音频帧是首帧,基于所述当前音频帧的能量更新初始语音能量分布数据,所述初始语音能量分布数据的各个能量区间均匀分布语音帧所占的比例。Wherein, in response to the fact that the current audio frame is the first frame, the initial speech energy distribution data is updated based on the energy of the current audio frame, and each energy interval of the initial speech energy distribution data is uniformly distributed with a proportion of the speech frame.
  12. 根据权利要求11所述的音频处理装置,其特征在于,确定模块被配置为:The audio processing device according to claim 11, wherein the determining module is configured to:
    确定所述当前音频帧的能量在语音能量分布数据中所属的能量区间;determining the energy interval to which the energy of the current audio frame belongs in the speech energy distribution data;
    增大所述前一音频帧的语音能量分布数据中与所确定的能量区间对应的能量区间的语音帧比例;increasing the speech frame ratio of the energy interval corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame;
    减小所述前一音频帧的语音能量分布数据中不与所确定的能量区间对应的能量区间的语音帧比例。reducing the speech frame proportion of the energy interval not corresponding to the determined energy interval in the speech energy distribution data of the previous audio frame.
  13. 根据权利要求11或12所述的音频处理装置,其特征在于,确定模块被配置为:The audio processing device according to claim 11 or 12, wherein the determining module is configured to:
    计算更新后的语音能量分布数据中的各个能量区间的语音帧比例之和;Calculating the sum of the speech frame ratios of each energy interval in the updated speech energy distribution data;
    通过对所述语音帧比例之和与预设值进行比较来确定残差概率;determining the residual probability by comparing the sum of the speech frame ratios with a preset value;
    将所述残差概率分配到更新后的语音能量分布数据的的各个能量区间,直到更新后的语音能量分布数据的各个能量区间的语音帧所占的比例之和为所述预设值。Allocating the residual probability to each energy interval of the updated speech energy distribution data until the sum of the proportions of speech frames in each energy interval of the updated speech energy distribution data reaches the preset value.
  14. 根据权利要求10所述的音频处理装置,其特征在于,第一增益模块被配置为:The audio processing device according to claim 10, wherein the first gain module is configured as:
    从针对所述当前音频帧的语音能量分布数据的第一个能量区间开始依次累加各个能量区间的语音帧比例,直到所述累加之和等于或大于预设阈值;Starting from the first energy interval of the speech energy distribution data for the current audio frame, the speech frame proportions of each energy interval are sequentially accumulated until the accumulated sum is equal to or greater than a preset threshold;
    响应于所述累加之和等于所述预设阈值,将累加到满足所述累加之和等于所述预设阈值的能量区间的上限作为第一能量界限;In response to the accumulated sum being equal to the preset threshold, taking the upper limit of the energy interval accumulated to satisfy the accumulated sum being equal to the preset threshold as a first energy limit;
    响应于所述累加之和大于所述预设阈值,将累加到满足所述累加之和大于所述预设阈值的能量区间的下限作为第一能量界限;In response to the accumulated sum being greater than the preset threshold, taking the lower limit of the energy interval accumulated to satisfy the accumulated sum being greater than the preset threshold as the first energy limit;
    根据所述当前音频帧的目标能量和所述第一能量界限来确定所述第一增益。The first gain is determined according to the target energy of the current audio frame and the first energy limit.
  15. 根据权利要求14所述的音频处理装置,其特征在于,第一增益模块被配置为:The audio processing device according to claim 14, wherein the first gain module is configured to:
    根据所述当前音频帧的目标能量和所述第一能量界限来确定初始第一增益;determining an initial first gain according to the target energy of the current audio frame and the first energy limit;
    根据所述当前音频帧的类型确定所述当前音频帧对应的帧数;determining the number of frames corresponding to the current audio frame according to the type of the current audio frame;
    通过对所述当前音频帧对应的帧数与预设帧数进行比较来调整所述初始第一增益并且将调整后的初始第一增益作为所述第一增益。The initial first gain is adjusted by comparing the frame number corresponding to the current audio frame with a preset frame number, and the adjusted initial first gain is used as the first gain.
  16. 根据权利要求10所述的音频处理装置,其特征在于,还包括第二增益模块,被配置为:The audio processing device according to claim 10, further comprising a second gain module configured to:
    基于所述第一增益和所述当前音频帧的能量确定第二能量界限;determining a second energy bound based on the first gain and the energy of the current audio frame;
    根据所述当前音频帧的目标能量和所述第二能量界限来确定初始第二增益;determining an initial second gain based on the target energy of the current audio frame and the second energy limit;
    基于所述当前音频帧中的音频采样点和所述初始第二增益来得到第二增益矢量;obtaining a second gain vector based on the audio sampling points in the current audio frame and the initial second gain;
    对所述第一音频帧应用所述第二增益以获得第二音频帧。Applying the second gain to the first audio frame to obtain a second audio frame.
  17. 根据权利要求16所述的音频处理装置,其特征在于,第二增益模块被配置为:The audio processing device according to claim 16, wherein the second gain module is configured to:
    基于所述当前音频帧的前一音频帧中最后一个音频采样点的增益以及所述初始第二增益分别计算针对所述当前音频帧中的每个音频采样点的增益,以生成所述第二增益矢量。Based on the gain of the last audio sampling point in the previous audio frame of the current audio frame and the initial second gain, respectively calculate the gain for each audio sampling point in the current audio frame to generate the second gain vector.
  18. 根据权利要求16所述的音频处理装置,其特征在于,第二增益模块被配置为:The audio processing device according to claim 16, wherein the second gain module is configured to:
    将所述第二增益矢量中的每个增益分别应用于所述第一音频帧的相应音频采样点,以获得第二音频帧;并且applying each gain in the second gain vector to corresponding audio samples of the first audio frame to obtain a second audio frame; and
    对所述第二音频帧的幅度进行限幅处理。Perform clipping processing on the amplitude of the second audio frame.
  19. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    至少一个处理器;at least one processor;
    至少一个存储计算机可执行指令的存储器,at least one memory storing computer-executable instructions,
    其中,所述计算机可执行指令在被所述至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:Wherein, the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the following steps:
    获取待处理的当前音频帧;Get the current audio frame to be processed;
    确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;
    基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;Acquiring speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;
    根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;
    对所述当前音频帧应用所述第一增益以获得第一音频帧。Applying the first gain to the current audio frame to obtain a first audio frame.
  20. 一种存储指令的计算机可读存储介质,其特征在于,当所述指令被 至少一个处理器运行时,促使所述至少一个处理器执行以下步骤:A computer-readable storage medium storing instructions, wherein when the instructions are executed by at least one processor, the at least one processor is prompted to perform the following steps:
    获取待处理的当前音频帧;Get the current audio frame to be processed;
    确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;
    基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;Acquiring speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;
    根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;
    对所述当前音频帧应用所述第一增益以获得第一音频帧。Applying the first gain to the current audio frame to obtain a first audio frame.
  21. 一种计算机程序产品,所述计算机程序产品中的指令被电子装置中的至少一个处理器运行以执行以下步骤:A computer program product in which instructions are executed by at least one processor in an electronic device to perform the following steps:
    获取待处理的当前音频帧;Get the current audio frame to be processed;
    确定所述当前音频帧的能量和类型,所述类型包括语音帧和非语音帧之一;determining the energy and type of the current audio frame, the type comprising one of a speech frame and a non-speech frame;
    基于所述当前音频帧的能量和类型来获得针对所述当前音频帧的语音能量分布数据,其中,语音能量分布数据用于统计不同能量区间的语音帧所占的比例;Acquiring speech energy distribution data for the current audio frame based on the energy and type of the current audio frame, wherein the speech energy distribution data is used to count the proportions of speech frames in different energy intervals;
    根据针对所述当前音频帧的语音能量分布数据来确定用于所述当前音频帧的第一增益;determining a first gain for the current audio frame based on speech energy distribution data for the current audio frame;
    对所述当前音频帧应用所述第一增益以获得第一音频帧。Applying the first gain to the current audio frame to obtain a first audio frame.
PCT/CN2022/107039 2021-12-03 2022-07-21 Audio processing method and audio processing apparatus WO2023098103A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111465600.1A CN114157254A (en) 2021-12-03 2021-12-03 Audio processing method and audio processing device
CN202111465600.1 2021-12-03

Publications (2)

Publication Number Publication Date
WO2023098103A1 WO2023098103A1 (en) 2023-06-08
WO2023098103A9 true WO2023098103A9 (en) 2023-08-03

Family

ID=80455988

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107039 WO2023098103A1 (en) 2021-12-03 2022-07-21 Audio processing method and audio processing apparatus

Country Status (2)

Country Link
CN (1) CN114157254A (en)
WO (1) WO2023098103A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114157254A (en) * 2021-12-03 2022-03-08 北京达佳互联信息技术有限公司 Audio processing method and audio processing device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544961B (en) * 2012-07-10 2017-12-19 中兴通讯股份有限公司 Audio signal processing method and device
CN104200810B (en) * 2014-08-29 2017-07-18 无锡中感微电子股份有限公司 Automatic gain control equipment and method
CN112564655A (en) * 2019-09-26 2021-03-26 大众问问(北京)信息科技有限公司 Audio signal gain control method, device, equipment and storage medium
CN113496706B (en) * 2020-03-19 2023-05-23 抖音视界有限公司 Audio processing method, device, electronic equipment and storage medium
CN113539285B (en) * 2021-06-04 2023-10-31 浙江华创视讯科技有限公司 Audio signal noise reduction method, electronic device and storage medium
CN114157254A (en) * 2021-12-03 2022-03-08 北京达佳互联信息技术有限公司 Audio processing method and audio processing device

Also Published As

Publication number Publication date
CN114157254A (en) 2022-03-08
WO2023098103A1 (en) 2023-06-08

Similar Documents

Publication Publication Date Title
JP7150939B2 (en) Volume leveler controller and control method
JP6325640B2 (en) Equalizer controller and control method
JP6573870B2 (en) Apparatus and method for audio classification and processing
US10973458B2 (en) Daily cognitive monitoring of early signs of hearing loss
JP2016531332A (en) Speech processing system
US11151981B2 (en) Audio quality of speech in sound systems
EP3792918B1 (en) Digital automatic gain control method and apparatus
WO2023098103A9 (en) Audio processing method and audio processing apparatus
US20160260445A1 (en) Audio Loudness Adjustment
US20120053937A1 (en) Generalizing text content summary from speech content
CN108829370B (en) Audio resource playing method and device, computer equipment and storage medium
US11146607B1 (en) Smart noise cancellation
CN113763974B (en) Packet loss compensation method and device, electronic equipment and storage medium
US20220189471A1 (en) Combining Device or Assistant-Specific Hotwords in a Single Utterance
CN111240634A (en) Sound box working mode adjusting method and device
CN112967732B (en) Method, apparatus, device and computer readable storage medium for adjusting equalizer
WO2022022211A1 (en) Volume adjustment method and apparatus based on speech control
US20220279300A1 (en) Steering of binauralization of audio
US11551707B2 (en) Speech processing method, information device, and computer program product
US10600432B1 (en) Methods for voice enhancement
US11790931B2 (en) Voice activity detection using zero crossing detection
CN108595144B (en) Volume adjusting method and device
CN111161750B (en) Voice processing method and related device
EP3671735A1 (en) Method and system for determining speaker-user of voice-controllable device
CN116206619A (en) Voice processing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22899923

Country of ref document: EP

Kind code of ref document: A1