CN113473316B

CN113473316B - Audio signal processing method, device and storage medium

Info

Publication number: CN113473316B
Application number: CN202110737797.3A
Authority: CN
Inventors: 严涛; 修平平; 朱赛男; 浦宏杰; 鄢仁祥
Original assignee: Suzhou Keda Technology Co Ltd
Current assignee: Suzhou Keda Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-01-31
Anticipated expiration: 2041-06-30
Also published as: CN113473316A

Abstract

The invention provides an audio signal processing method, an audio signal processing device and a storage medium, wherein the audio signal processing method comprises the following steps: acquiring a current frame audio signal as a first audio signal; carrying out first automatic gain processing on the first audio signal to obtain a second audio signal; the first automatic gain processing specifically includes: dividing the first audio signal into audio signals of a plurality of first sub-frames; calculating to obtain a first parameter value according to the audio parameter of each first subframe; comparing the size of the first parameter value with a first threshold interval, and accumulating to obtain the number M of the subframes of which the first parameter value is not in the first threshold interval; and when M is larger than a first statistic value, simultaneously performing signal gain on the sample points of the first audio signal. The invention simultaneously gains the sample points of the first audio signal, thereby maintaining the dynamic range of the audio, keeping the fluctuation sense of the volume of the audio and improving the voice quality.

Description

Audio signal processing method, device and storage medium

Technical Field

The present invention relates to the field of audio technologies, and in particular, to a method and an apparatus for processing an audio signal, and a storage medium.

Background

With the wide use of modern communication technologies, competition among communication enterprises is continuously aggravated, and in order to improve the competitive advantage of the communication enterprises, the communication enterprises need to improve the quality of communication signals of the communication enterprises and improve the stability, safety and efficiency of each index of a communication system.

In the implementation of the audio signal processing method, an automatic gain control Algorithm (AGC) algorithm is adopted, so that the stability of an audio signal system and audio signal output can be improved, and the problem of signal distortion after AGC debugging is solved. At present, most of AGC (automatic gain control algorithm) processes each point in a frame of voice signal point by point, which results in loss of voice dynamics and reduces fluctuation of voice.

Disclosure of Invention

In view of this, embodiments of the present invention provide an audio signal processing method, an audio signal processing apparatus, and a storage medium, so as to solve the technical problem in the prior art that a loss of speech dynamics is caused after audio signal processing.

In a first aspect, the present invention provides an audio signal processing method, including the steps of: acquiring a current frame audio signal as a first audio signal; carrying out first automatic gain processing on the first audio signal to obtain a second audio signal; the step of performing a first automatic gain processing on the first audio signal to obtain a second audio signal specifically includes: dividing the first audio signal into audio signals of a plurality of first sub-frames; calculating to obtain a first parameter value according to the audio parameter of each first subframe; comparing the size of the first parameter value with a first threshold interval, and accumulating to obtain the number M of the subframes of which the first parameter value is not in the first threshold interval; and when M is larger than a first statistic value, simultaneously performing signal gain on the sample points of the first audio signal.

Further, after the step of performing the first automatic gain processing on the first audio signal to obtain the second audio signal, the method further includes: carrying out echo cancellation processing on the second audio signal to obtain a third audio signal; performing second automatic gain processing on the third audio signal to obtain a fourth audio signal; and outputting the fourth audio signal.

Further, when the VAD is performed on the third audio signal, if the VAD does not detect the voice, the second automatic gain processing is: and simultaneously performing signal gain on the sample points of the third audio signal, wherein the gain value of the second automatic gain is the gain value of the last frame of the third audio signal.

Further, when the VAD is performed on the third audio signal, if the VAD detects speech, the second automatic gain processing specifically includes the following steps: slicing the third audio signal into audio signals of a plurality of second sub-frames; calculating to obtain a second parameter value according to the audio parameter of each second subframe; comparing the size of the second parameter value with a second threshold interval, and accumulating to obtain the number N of the subframes of which the second parameter value is not in the second threshold interval; judging whether N is larger than a second statistic numerical value; and if so, simultaneously performing signal gain on the sample points of the third audio signal.

Further, the step of performing signal gain on the sample points of the first audio signal simultaneously specifically includes the following steps: dividing the number M of first subframes which are not in the first threshold interval into two types, wherein the first type is the number Ma on the left side of the first threshold interval; wherein the second class is the number Mb to the right of the first threshold interval, where Ma + Mb = M; comparing the Ma and Mb sizes; when Ma is larger than Mb, judging that the gain value of the first gain is a positive number, and increasing the gain at the moment; and when the Ma is smaller than the Mb, judging that the gain value of the first gain is a negative number, and reducing the gain.

Further, the step of simultaneously performing signal gain on the sample points of the third audio signal specifically includes the following steps: dividing the number N of second subframes not in the second threshold interval into two types, wherein the first type is the number Na on the left side of the second threshold interval; the second type is the number Nb on the right side of the second threshold interval, where Na + Nb = N; comparing the sizes of Na and Nb; when Na is larger than Nb, judging that the gain value of the first gain is a positive number, and increasing the gain at the moment; and when the Na is smaller than the Nb, judging that the gain value of the second gain is negative, and reducing the gain.

Further, the first threshold interval is located in a central interval of the volume interval; the difference between the values of the left end point of the first threshold interval and the left end point of the volume interval is equal to the difference between the values of the right end point of the first threshold interval and the right end point of the volume interval.

Further, the volume interval includes n consecutive sub-intervals, the first threshold interval is a kth sub-interval, and the kth sub-interval is a central interval of the volume interval; then, in the step of cumulatively obtaining the number M of the subframes of which the first parameter value is not within the first threshold interval, the method further includes: counting the number Q of the first parameter values in the n subintervals respectively, wherein Q is less than or equal to M; when M is greater than a first count value, the step of performing signal gain on the sample points of the first audio signal simultaneously includes: when the number Q of at least one subinterval is larger than the corresponding third statistic numerical value, simultaneously performing signal gain on the sample points of the first audio signal; and the third statistical number value corresponding to each subinterval is in negative correlation with the distance from the subinterval to the first threshold interval.

In a second aspect, the present invention provides an audio signal processing apparatus comprising: the acquisition module is used for acquiring a current frame audio signal as a first audio signal; the first automatic gain processing module is used for carrying out first automatic gain processing on the first audio signal to obtain a second audio signal; wherein, the first automatic gain processing module specifically comprises: a dividing unit that divides the first audio signal into audio signals of a plurality of first sub-frames; the calculating unit is used for calculating to obtain a first parameter value according to the audio parameter of each first subframe; the statistical unit is used for comparing the size of the first parameter value with a first threshold interval and accumulating to obtain the number M of the subframes of which the first parameter value is not in the first threshold interval; and the judging unit is used for simultaneously carrying out signal gain on the sample points of the first audio signal when the M is larger than a first statistic value.

In a third aspect, the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to execute the audio signal processing method.

The technical scheme of the invention has the following advantages:

the invention provides an audio signal processing method, an audio signal processing device and a storage medium, wherein a frame of audio signal is divided into a plurality of subframes, and the audio parameter of each subframe is judged according to the threshold interval. Finally, the sample points of one frame of audio signal are simultaneously gained, thus the dynamic range of the audio is kept, the fluctuation of the volume of the audio is kept, and the voice quality can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an audio signal processing method provided according to an embodiment of the present invention;

fig. 2 is a flowchart of an audio signal processing method provided according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a first threshold interval provided in accordance with an embodiment of the present invention;

fig. 4 is a flowchart of an audio signal processing method provided according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a second threshold interval provided in accordance with an embodiment of the present invention;

fig. 6 is a schematic diagram of an audio signal processing apparatus provided according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an audio signal processing apparatus provided according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of an audio signal processing system provided in accordance with an embodiment of the invention;

fig. 9 is a schematic diagram of an electronic device provided in accordance with an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In one embodiment, as shown in fig. 1 and fig. 2, the present invention provides an audio signal processing method, which divides a first audio signal of a current frame into a plurality of subframes, performs threshold determination on an audio parameter of each subframe, and applies a changed gain to a sample point of the current frame signal while performing signal gain, so as to consider a dynamic continuous characteristic of the audio signal during processing, and the audio signal processing method includes the following steps S1 to S5.

S1, acquiring a current frame audio signal as a first audio signal.

The first audio signal is provided with a plurality of sample points, and the number of the sample points is set according to actual needs. In one embodiment the 48k sample rate frame audio signal is 512 sample points, in other embodiments it may be 480 points.

And S2, carrying out first automatic gain processing on the first audio signal to obtain a second audio signal.

This process is used to eliminate the problem of popping before the audio signal is input. The step of performing the first automatic gain processing on the first audio signal to obtain the second audio signal specifically includes S201 to S204.

S201, the first audio signal is divided into a plurality of audio signals of first sub-frames.

In the present embodiment, the slicing criterion is equal slicing, for example, equal slicing is performed according to the total duration of the current frame audio signal, i.e. the first audio signal, into a plurality of first subframes with the same duration. For example, in one embodiment, the first audio signal is equally divided into audio signals of 16 first sub-frames, and 32 sample points are taken per sub-frame. Other ways of segmenting are included in other embodiments.

S202, a first parameter value is obtained through calculation according to the audio parameter of each first subframe.

Wherein the audio parameters include: mean energy, and/or envelope peak. If the audio parameter is the mean energy RMS, it is used to characterize the amount of energy in the signal. The numerical value of the first parameter is calculated by the formula

y _i Represents the audio amplitude value of the sample point in the ith subframe, and k is the number of the sample points of the ith subframe. If the audio parameter is an envelope peak, the first parameter value is a sample point where the envelope peak represents the maximum amplitude value in the signal.

S203, comparing the value of the first parameter with a first threshold interval, and accumulating to obtain the number M of first subframes of the audio parameter, which are not in the first threshold interval.

In an embodiment, the first threshold interval is located in a central interval of a volume interval, and the first threshold interval is a range of volume; the difference between the values of the left end point of the first threshold interval and the left end point of the volume interval is equal to the difference between the values of the right end point of the first threshold interval and the right end point of the volume interval. If the target speech level at the center point is set to-22 dBfs, the left end point of the first threshold interval may be obtained by subtracting one unit from-22 dBfs, and the right end point of the first threshold interval may be obtained by adding one unit to-22 dBfs.

Specifically, after the value range of the first threshold interval is determined, the size of the first parameter value calculated by each first subframe is sequentially compared with the first threshold interval, whether the first parameter value of the current first subframe falls into the first threshold interval is judged, if not, the value of the statistical frequency is added by 1, if so, the value of the statistical frequency is added by 0, the first parameter value of the next first subframe of the current first subframe is obtained, the judging step is repeated until the statistical frequency value M is greater than the first statistical frequency value, counting is stopped or until the first parameter values corresponding to all the first subframes of the first audio signal are compared with the first threshold interval, and the statistical frequency value M is obtained.

In another embodiment, the sum of the audio parameters of the first subframe is calculated, and then the volume interval corresponding to the audio parameters is divided into a plurality of continuous subintervals, for example, the volume interval includes n continuous subintervals, the first threshold interval is a kth subinterval, and the kth subinterval is a central interval of the volume interval; the interval size of each subinterval in the volume interval is two unit lengths, the unit length can be set according to actual needs, for example, the unit length is set to 2dBfs, the optimal voice level is used as the central point of the volume interval, for example, the target voice level of the central point is set to-22 dBfs, the left end point of the first threshold interval can be obtained by subtracting one unit from-22 dBfs, and the right end point of the first threshold interval can be obtained by adding one unit from-22 dBfs, namely, the first threshold interval is (-24 dBfs, -20 dBfs); the subinterval adjacent to the right end point of the first threshold interval is (-20 dBfs, -16 dBfs), the subinterval adjacent to the left end point of the first threshold interval is (-28 dBfs, -24 dBfs), and so on, the range of each subinterval in the volume interval is obtained.

The value of each sub-interval is only an example, and in other embodiments, the interval size of the first threshold interval may be set to be two unit lengths, and the sizes of the sub-intervals except the first threshold interval are one unit length, that is, the first threshold interval is (-24 dBfs, -20 dBfs); the subinterval adjacent to the right end point of the first threshold interval is (-20 dBfs, -18 dBfs), the subinterval adjacent to the left end point of the first threshold interval is (-26 dBfs, -24 dBfs), and so on. The specific setting of the interval length of each subinterval can be set according to actual needs. In one embodiment, as shown in fig. 3, the volume interval is divided into 7 sub-intervals, and the (4) th sub-interval is used as the first threshold interval.

When the volume interval includes a plurality of sub-intervals, the step of accumulating the number M of the sub-frames of which the first parameter value is not in the first threshold interval includes: and counting the number Q of the first parameter values in the n subintervals respectively, wherein Q is less than or equal to M.

Specifically, the number of first parameter values corresponding to each subframe in the first audio signal falling in each subinterval is counted, and each subinterval corresponds to a count value.

In the embodiment of the invention, the audio parameter of the first subframe is used as the algorithm parameter, so that the comparison is carried out by setting the first threshold interval of a dynamic range to judge whether the gain adjustment is needed.

And S204, when the M is larger than a first statistic numerical value, simultaneously performing signal gain on the sample points of the first audio signal.

Specifically, the first count value is a preset value set according to a tolerable degree of deviation of the audio signal. That is, an audio signal in a first subframe corresponding to a first parameter value falling within a first threshold interval is considered as a standard audio signal, an audio signal corresponding to a first subframe corresponding to a first parameter value falling outside the first threshold interval is a non-standard audio signal, and when a first statistical count threshold is obtained according to a statistical count value that the first parameter value of each first subframe in the current frame of audio signal (first audio signal) falls outside the first threshold interval, it is indicated that the first audio signal has a larger proportion of non-standard audio signals, and a gain adjustment needs to be performed on the first audio signal, so that the first audio signal meets the requirement of audio quality. And the statistics is carried out by combining the conditions of the plurality of first subframes, whether gain adjustment is carried out or not is determined by using the statistical value, and the gain adjustment is carried out when a certain first subframe does not meet the optimal standard requirement, so that the gain of the audio signal is prevented from being adjusted by frequency, and meanwhile, the statistical value is more universal, and the accuracy of the gain adjustment is ensured.

In one embodiment, simultaneously signal-gaining sample points of the first audio signal when M is greater than the first statistical value comprises: dividing the number M of first subframes which are not in the first threshold interval into two types, wherein the first type is the number Ma of the subframes on the left side of the first threshold interval; wherein the second type is the number Mb on the right side of the first threshold interval, where Ma + Mb = M; comparing the magnitude of Ma with Mb; when Ma is larger than Mb, judging that the first gain value is a positive number, and increasing the gain at the moment; and when the Ma is smaller than the Mb, judging that the first gain value is a negative number, and reducing the gain.

In another embodiment, the volume interval includes n consecutive sub-intervals, and a third statistical value corresponding to each sub-interval is respectively set, so that when M is greater than the first statistical value, the step of performing signal gain on the sample points of the first audio signal simultaneously includes: when the number Q of at least one subinterval is larger than the corresponding third statistic numerical value, simultaneously performing signal gain on the sample points of the first audio signal; wherein the third count value is inversely related to the distance from the subinterval to the first threshold interval.

Specifically, since the volume areas corresponding to different subintervals are different, and the difference from the standard volume is different, the larger the difference from the standard volume is, the larger the influence on the audio quality is, and thus the corresponding tolerable degree is smaller. If the first threshold interval is an interval of the standard volume, the audio signal falling into the interval does not need to be subjected to gain processing, and therefore the third statistical frequency of the subinterval corresponding to the first threshold interval can be set to be infinite; the sub-interval close to the first threshold interval has a smaller loss relative to the standard volume, so that the third statistical times corresponding to the sub-interval adjacent to the first threshold interval is set to be the first preset value, for example, the first preset value is 20 times; and setting the third statistical times corresponding to the subintervals separated from the first threshold interval by one subinterval as a second preset value, wherein the second preset value is smaller than the first preset value, and if the second preset value is 10 times, and so on, respectively setting the corresponding third statistical times according to the distances from the subintervals to the first threshold interval.

In one embodiment, as shown in FIG. 3, the first threshold interval is the (4) th interval [ -23, -21]; other subintervals are (-21, -19), (-19, -17), (-25, -23), (-27, -25), respectively; setting the third statistic value corresponding to the subintervals (-21, -19) and (-25, -23) to be 20 times; setting the third statistic value corresponding to subintervals (-19, -17), (-27, -25) to 10. Respectively calculating first parameter values corresponding to each first subframe in the first audio signal, counting the range of subintervals in which each first parameter value falls, and accumulating the times of falling into each subinterval, wherein if the accumulated times of falling into the subinterval (-21, -19) of each first subframe of the first audio signal is 12 times, the accumulated times of falling into the subinterval (-27, -25) is 10 times, and because the accumulated times of falling into the subinterval (-27, -25) reaches a third statistical times corresponding to the subinterval, simultaneously performing signal gain on a sample point of the first audio signal. That is, as long as the number of times of statistics falling into one of the subintervals reaches the third number of times of statistics corresponding to the subintervals, the signal gain is performed on the sample points of the first audio signal at the same time.

The first gain calculation value can be obtained by looking up a preset gain table, for example, the gain value at the middle position of the gain table is 1, the gain value is decreased by 0.2 by advancing one bit, and the gain value is increased by 0.2 by advancing one bit, when the gain value at the last moment of a certain moment is at the middle position, if Ma is greater than the first statistic value and Ma is greater than Mb at the moment, the gain is positive, the new gain value obtained by moving one bit backwards from the middle position of the gain table is 1.2; if Mb is greater than the first statistic value and Ma is far less than Mb at this moment, the gain is negative, and the new gain value obtained by moving one bit from the middle position of the gain table is 0.8. The gain value of the first automatic gain is less than or equal to 1.

And S3, carrying out echo cancellation processing on the second audio signal to obtain a third audio signal.

Performing echo cancellation processing on the second audio signal may prevent a far-end participant in the teleconference from hearing an echo of his own voice. In a telephone call or teleconference there is one near end and one far end. The near end is the location of your call and the far end is the location of other participants in the call. At each location, there is at least one microphone and one speaker. In other embodiments, other processing such as denoising may be performed on the second audio signal in step S3. This prevents the far-end participants in the teleconference from hearing echoes of their own voice.

And S4, performing second automatic gain processing on the third audio signal to obtain a fourth audio signal.

And performing automatic gain processing on the third audio signal again to obtain a fourth audio signal, and outputting the fourth audio signal, that is, performing automatic gain processing again before the audio signal is output, so that the tone quality of the output audio signal can be improved.

The second automatic gain judging step is the same as the first automatic gain judging step, but the processing procedure is slightly different. As shown in fig. 4, specifically, the second automatic gain processing specifically includes the following steps: s401 to S405.

S401, VAD detection is performed on the third audio signal, and if VAD detects speech, the process proceeds to step S402.

When the VAD detection is performed on the third audio signal, if the VAD does not detect the voice, the second automatic gain processing is: and performing signal gain on the third audio signal, wherein the gain value of the second automatic gain is the gain value of the last frame of the third audio signal.

S402, the third audio signal is divided into a plurality of audio signals of second sub-frames.

And S403, calculating to obtain a second parameter value according to the audio parameter of each second subframe. The calculation method of the second parameter value refers to the calculation method of the first parameter value in step S202.

S404, comparing the value of the second parameter with a second threshold interval, and accumulating to obtain the number N of the sub-frames of the audio parameter which is not in the second threshold interval.

As shown in fig. 5, in an embodiment, the second threshold interval is located in a central interval of a volume interval, and the second threshold interval is a range of volume; the difference between the values of the left end point of the second threshold interval and the left end point of the volume interval is equal to the difference between the values of the right end point of the second threshold interval and the right end point of the volume interval. If the target speech level at the center point is set to-22 dBfs, the left end point of the first threshold interval may be obtained by subtracting two units from-22 dBfs, and the right end point of the first threshold interval may be obtained by adding two units to-22 dBfs.

In another embodiment, the volume interval includes n consecutive sub-intervals, the second threshold interval is a kth sub-interval, and the kth sub-interval is a central interval of the volume interval; then, in the step of cumulatively obtaining the number N of the subframes for which the second parameter value is not within the second threshold interval, the method further includes: and counting the number q of the second parameter values in the N subintervals respectively, wherein q is less than or equal to N.

And S405, when N is larger than a second statistic numerical value, simultaneously performing signal gain on the sample points of the third audio signal.

Specifically, the step of simultaneously performing signal gain on the sample points of the third audio signal includes: dividing the number N of second subframes not in the second threshold interval into two types, wherein the first type is the number Na on the left side of the second threshold interval; the second type is the number Nb on the right side of the second threshold interval, where Na + Nb = N; comparing the sizes of Na and Nb; when Na is larger than Nb, judging that the gain value of the second gain is a positive number, and increasing the gain at the moment; and when the Na is smaller than the Nb, judging that the gain value of the second gain is negative, and reducing the gain.

In an embodiment, the volume interval includes N consecutive sub-intervals, the number q of the second parameter value in the N sub-intervals is counted, and when N is greater than a second counted number value, the step of performing signal gain on the sample points of the first audio signal at the same time includes: when the number q of at least one subinterval is larger than the corresponding fourth statistic value, simultaneously performing signal gain on the sample points of the third audio signal; wherein the fourth count value is inversely related to the distance from the subinterval to the second threshold interval.

In this embodiment, the manner of acquiring the range of each sub-interval and the manner of acquiring the fourth statistical number corresponding to each sub-interval are the same as the manner of acquiring the range of each sub-interval related to the first threshold interval and the manner of acquiring the third statistical number corresponding to each sub-interval in the above embodiment, and are not described herein again.

Specifically, as shown in fig. 5, that is, the second threshold interval is the (4) th sub-interval of [ -24, -20]; other subintervals are (-20, -18), (-18, -16), (-26, -24), (-28, -26), respectively; setting the fourth statistic value corresponding to the subintervals (-20, -18), (-26, -24) to be 15; the subintervals (-18, -16), (-28, -26) are set to correspond to a third statistical count of 5. Respectively calculating first parameter values corresponding to each first subframe in the first audio signal, counting the range of subintervals in which each first parameter value falls and accumulating the times of the subintervals, wherein if the times of the first subframes of the first audio signal which fall into the subintervals (-26, -24) are 15 times, the times of the first subframes which fall into the subintervals (-28, -26) are 4 times, and because the times of the first subframes which fall into the subintervals (-26, -24) reach the fourth times of the first statistics corresponding to the subintervals, simultaneously performing signal gain on sample points of the third audio signal. That is, as long as the number of times of statistics that fall into one of the subintervals reaches the fourth number of times of statistics corresponding to the subintervals, the signal gain is performed on the sample points of the third audio signal at the same time.

The second gain calculation value may be obtained by looking up a preset gain table, for example, if the gain value at the middle position of the gain table is 1, the gain value is decreased by 0.2 by advancing one bit, and the gain value is increased by 0.2 by advancing one bit, and if Na at the last moment is greater than the first statistical number value and Na is greater than Nb at the middle position at the last moment, the gain value is positive gain, and the new gain value obtained by moving one bit from the middle position of the gain table backward is 1.2; if the current moment Nb is greater than the first statistic and Na is less than Nb, the gain is negative, and the new gain value obtained by moving one bit from the middle position of the gain table is 0.8.

And S5, outputting the fourth audio signal.

In one embodiment, if there is no VAD detection in the first automatic gain processing step, the second automatic gain processing step includes VAD detection. Because of the need to take into account noisy conditions, the gain should remain constant in the presence of stationary or non-stationary noise. Voice Activity Detection (VAD), also called Voice endpoint Detection, voice boundary Detection. The aim is to identify and eliminate long silent periods from the voice signal stream to achieve the effect of saving speech path resources without reducing the quality of service.

The invention provides an audio signal processing method, which comprises the steps of dividing a frame of audio signal into a plurality of subframes, and judging the audio parameter and the threshold interval of each subframe.

In this embodiment, the first automatic gain process is only used to attenuate a large volume voice signal, and should be maintained for a small volume or a normal volume voice signal, while the second automatic gain process should take account of the situations of a large volume and a small volume, so the variation range of the gain value should be between a positive value and a negative value, that is, if the volume is large, the gain is a negative value; the volume is small and the gain is positive.

In one embodiment, as shown in fig. 6, the present invention provides an audio signal processing apparatus including: an acquisition module 11, a first automatic gain module 12, a signal processing module 13, and a second automatic gain module 14.

The obtaining module 11 is configured to obtain a current frame audio signal as a first audio signal to be input.

The first automatic gain module 12 is configured to perform a first automatic gain processing on the first audio signal to obtain a second audio signal.

The signal processing module 13 is configured to perform echo cancellation processing on the second audio signal to obtain a third audio signal.

The second automatic gain module 14 is configured to perform a second automatic gain processing on the third audio signal to obtain a fourth audio signal.

In one embodiment, as shown in fig. 7, the first automatic gain module 12 specifically includes: a segmentation unit 101, a calculation unit 102, a statistics unit 103, and a judgment unit 104.

The slicing unit 101 is configured to slice the first audio signal into audio signals of a plurality of first sub-frames.

The calculating unit 102 is configured to calculate a first parameter value according to the audio parameter of each first subframe.

The counting unit 103 is configured to compare the size of the first parameter value with a first threshold interval, and accumulate to obtain the number M of subframes where the first parameter value is not within the first threshold interval.

The determining unit 104 performs signal gain on the sample points of the first audio signal simultaneously when M is greater than a first statistic value.

The invention provides an audio signal processing device, which divides a frame of audio signal into a plurality of subframes through an automatic gain module and judges audio parameters of the subframes and a threshold interval. Finally, gain is carried out on the sample point of a frame of audio signal, so that the dynamic range of the audio is kept, the fluctuation sense of the volume of the audio is kept, and the voice quality can be improved.

In one embodiment, as shown in fig. 8, the present invention provides an audio signal processing system 200 comprising: an input device 201, a processor 202, and an output device 203.

The input device 201 is connected with the processor 202, and the output device 203 is connected with the processor 202. The input device 201 comprises a microphone and the output device 203 comprises a speaker.

The processor 51 may call program instructions to implement the audio signal processing method as shown in the embodiments of fig. 1 and fig. 2 of the present application.

The specific method of using the audio signal processing system 200 comprises the following steps:

the input device obtains the current frame audio signal as a first audio signal and inputs the first audio signal to the processor 202;

the processor 202 performs a first automatic gain processing on the first audio signal to obtain a second audio signal;

the processor 202 performs echo cancellation processing on the second audio signal to obtain a third audio signal;

the processor 202 performs a second automatic gain processing on the third audio signal to obtain a fourth audio signal, and sends the fourth audio signal to the output device 203;

the output means 203 outputs the fourth audio signal.

The steps of the processor 202 performing the first automatic gain processing on the first audio signal to obtain the second audio signal are steps S201 to S204 in the above embodiment.

The invention provides an audio signal processing system 200 for preventing pop sound, performing low gain for high sound and maintaining the same for low sound by performing a first automatic gain process on a first audio signal acquired by an input device. And carrying out second automatic gain processing on the output second audio signal, wherein the second automatic gain processing is used for low volume boosting and high volume attenuation, and the voice quality can be improved.

According to the first automatic gain or the second automatic gain processing, one frame of audio signal is divided into a plurality of subframes, and the audio parameters of the subframes and the threshold interval are judged. And finally, the gain is carried out on the first audio signal of one frame, so that the dynamic range of the audio is kept, and the fluctuation sense of the volume of the audio is kept.

In an embodiment, please refer to fig. 9, where fig. 9 is a schematic structural diagram of an electronic device according to an alternative embodiment of the present invention, the electronic device may include: at least one processor 51, such as a CPU (Central Processing Unit), at least one communication interface 53, memory 54, at least one communication bus 52. Wherein the communication bus 52 is used to enable connection communication between these components. The communication interface 53 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 53 may also include a standard wired interface and a standard wireless interface. The Memory 54 may be a high-speed RAM Memory (volatile Random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 54 may alternatively be at least one memory device located remotely from the processor 51. Wherein the processor 51 may be combined with the apparatus described in fig. 6 and fig. 7, the memory 54 stores an application program, and the processor 51 calls the program code stored in the memory 54 for executing the steps of the audio signal processing method according to any of the above embodiments.

The communication bus 52 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 52 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The memory 54 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (e.g., flash memory), a hard disk (HDD) or a solid-state drive (SSD); the memory 54 may also comprise a combination of the above types of memory.

The processor 51 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 51 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 54 is also used to store program instructions. The processor 51 may call program instructions to implement an audio signal processing method as shown in any of the embodiments of the present application.

Embodiments of the present invention further provide a non-transitory computer storage medium, where computer-executable instructions are stored, and the computer-executable instructions may execute the audio signal processing method in any of the method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. An audio signal processing method, comprising the steps of:

acquiring a current frame audio signal as a first audio signal;

carrying out first automatic gain processing on the first audio signal to obtain a second audio signal;

the step of performing a first automatic gain process on the first audio signal to obtain a second audio signal specifically includes:

dividing the first audio signal into audio signals of a plurality of first sub-frames;

calculating to obtain a first parameter value according to the audio parameter of each first subframe; wherein the audio parameters include: mean energy, and/or envelope peak;

comparing the size of the first parameter value with a first threshold interval, and accumulating to obtain the number M of the subframes of which the first parameter value is not in the first threshold interval;

and when M is larger than a first statistic numerical value, simultaneously performing signal gain on the sample points of the first audio signal.

2. The audio signal processing method of claim 1, further comprising, after the step of performing the first automatic gain processing on the first audio signal to obtain the second audio signal:

carrying out echo cancellation processing on the second audio signal to obtain a third audio signal;

performing second automatic gain processing on the third audio signal to obtain a fourth audio signal; and

outputting the fourth audio signal.

3. The audio signal processing method of claim 2, wherein when the VAD detects the third audio signal, if the VAD does not detect the voice, the second automatic gain processing is: and simultaneously performing signal gain on the sample points of the third audio signal, wherein the gain value of the second automatic gain is the gain value of a previous frame of the third audio signal.

4. The audio signal processing method of claim 2, wherein when the VAD detects the third audio signal, if the VAD detects voice, the second automatic gain processing specifically includes the following steps:

slicing the third audio signal into audio signals of a plurality of second sub-frames;

calculating to obtain a second parameter value according to the audio parameter of each second subframe;

comparing the size of the second parameter value with a second threshold interval, and accumulating to obtain the number N of the subframes of which the second parameter value is not in the second threshold interval;

judging whether N is larger than a second statistic numerical value; and if so, simultaneously performing signal gain on the sample points of the third audio signal.

5. The audio signal processing method according to claim 2,

in the step of performing signal gain on the sample points of the first audio signal simultaneously, the method specifically includes the following steps:

dividing the number M of first subframes which are not in the first threshold interval into two types, wherein the first type is the number Ma on the left side of the first threshold interval; wherein the second class is the number Mb to the right of the first threshold interval, where Ma + Mb = M;

comparing the Ma and Mb sizes; when Ma is larger than Mb, judging that the gain value of the first automatic gain is a positive number, and increasing the gain at the moment; and when the Ma is smaller than the Mb, judging that the gain value of the first automatic gain is negative, and reducing the gain.

6. The audio signal processing method according to claim 4,

in the step of performing signal gain on the sample points of the third audio signal simultaneously, the method specifically includes the following steps:

dividing the number N of second subframes not in the second threshold interval into two types, wherein the first type is the number Na on the left side of the second threshold interval; the second type is the number Nb on the right side of the second threshold interval, where Na + Nb = N;

comparing the sizes of Na and Nb; when Na is larger than Nb, judging that the gain value of the first automatic gain is a positive number, and increasing the gain at the moment; and when Na is smaller than Nb, judging that the gain value of the second automatic gain is negative, and reducing the gain at the moment.

7. The audio signal processing method according to claim 1, wherein the first threshold interval is located in a center interval of a volume interval; the difference between the values of the left end point of the first threshold interval and the left end point of the volume interval is equal to the difference between the values of the right end point of the first threshold interval and the right end point of the volume interval.

8. The audio signal processing method of claim 7, wherein the volume interval comprises n consecutive sub-intervals, the first threshold interval is a k-th sub-interval, and the k-th sub-interval is a central interval of the volume interval; then, in the step of cumulatively obtaining the number M of the subframes of which the first parameter value is not within the first threshold interval, the method further includes:

counting the number Q of the first parameter values in the n subintervals respectively, wherein Q is less than or equal to M;

when M is greater than a first statistical count value, the step of performing signal gain on the sample points of the first audio signal simultaneously includes: when the number Q of at least one subinterval is larger than the corresponding third statistic numerical value, simultaneously performing signal gain on the sample points of the first audio signal; and the third statistical number value corresponding to each subinterval is in negative correlation with the distance from the subinterval to the first threshold interval.

9. An audio signal processing apparatus, comprising:

the acquisition module is used for acquiring a current frame audio signal as a first audio signal;

the first automatic gain processing module is used for carrying out first automatic gain processing on the first audio signal to obtain a second audio signal;

wherein, the first automatic gain processing module specifically comprises:

a slicing unit slicing the first audio signal into audio signals of a plurality of first subframes;

the calculating unit is used for calculating to obtain a first parameter value according to the audio parameter of each first subframe; wherein the audio parameters include: mean energy, and/or envelope peak;

the statistical unit is used for comparing the size of the first parameter value with a first threshold interval and accumulating to obtain the number M of the subframes of which the first parameter value is not in the first threshold interval;

and the judging unit is used for simultaneously carrying out signal gain on the sample points of the first audio signal when the M is larger than a first statistic numerical value.

10. A computer-readable storage medium storing computer instructions for causing a computer to execute the audio signal processing method according to any one of claims 1 to 8.