CN113488076A

CN113488076A - Audio signal processing method and device

Info

Publication number: CN113488076A
Application number: CN202110736375.4A
Authority: CN
Inventors: 操陈斌
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-08

Abstract

The present disclosure relates to the field of voice communication technologies, and in particular, to an audio signal processing method and apparatus. An audio signal processing method comprising: acquiring an audio signal picked up by a microphone; the audio signal is detected, and the processing operation of the audio signal is switched from a first processing operation to a second processing operation in response to the detection of the target test signal from the audio signal. The method improves the reliability and the test result of the acoustic test scene.

Description

Audio signal processing method and device

Technical Field

The present disclosure relates to the field of voice communication technologies, and in particular, to an audio signal processing method and apparatus.

Background

The 3GPP (3rd Generation Partnership Project) is the most important global communication standardization organization responsible for formulating an end-to-end system specification for wireless communication as a whole, wherein for a telephone terminal acoustic characteristics standard, performance requirements and test methods for acoustic characteristics of narrowband, wideband, ultra wideband or full band phones are specified.

In the related art, the acoustic test effect of the voice communication system of the terminal is poor due to the difference between the test scene and the actual application scene.

Disclosure of Invention

In order to improve the acoustic test effect and the voice communication quality of a voice communication system, the disclosed embodiments provide an audio signal processing method, an apparatus, an electronic device, and a storage medium.

In a first aspect, the disclosed embodiments provide an audio signal processing method, including:

acquiring an audio signal picked up by a microphone;

the audio signal is detected, and the processing operation of the audio signal is switched from a first processing operation to a second processing operation in response to the detection of the target test signal from the audio signal.

In some embodiments, the switching the processing operation on the audio signal from a first processing operation to a second processing operation comprises:

switching the noise reduction processing operation on the audio signal from an on state to an off state;

and/or the presence of a gas in the gas,

switching a suppression parameter of an echo suppression operation on the audio signal from a first parameter to a second parameter.

In some embodiments, the audio signal comprises a multi-frame sub-signal that is consecutive in the time domain; detecting a target test signal from the audio signal, comprising:

acquiring a peak-to-valley ratio characteristic value of a current frame sub-signal;

and determining that the target test signal is detected in response to the peak-to-valley ratio characteristic value not being less than a first preset threshold value.

In some embodiments, the obtaining the peak-to-valley ratio feature value of the current frame sub-signal includes:

acquiring a power spectrum of an analysis frame signal; the analysis frame signal comprises the current frame sub-signal and continuous sub-signals of a preset number of frames before the current frame sub-signal;

determining the peak-to-trough ratio of each peak and trough in the analysis frame signal according to the power spectrum;

and determining the peak-to-valley ratio characteristic value of the current frame sub-signal based on each peak-to-valley ratio value.

In some embodiments, said determining a peak-to-valley ratio of each peak and valley in said analysis frame signal from said power spectrum comprises:

for any peak, determining a first energy sum of the peak, a second energy sum of a former trough adjacent to the peak and a third energy sum of a latter trough adjacent to the peak according to the power spectrum;

determining the peak-to-valley ratio based on the first energy sum, the second energy sum, and the third energy sum.

In some embodiments, said determining said peak-to-valley ratio characteristic value of said current frame subsignal based on respective peak-to-valley ratio values comprises:

responding to the fact that the ratio of the wave peak to the wave trough is not smaller than a second preset threshold value, and determining that the wave peak detection result corresponding to the ratio of the wave peak to the wave trough is a first numerical value; responding to the fact that the ratio of the wave peak to the wave trough is smaller than a second preset threshold value, and determining that the wave peak detection result corresponding to the ratio of the wave peak to the wave trough is a second numerical value;

and determining the numerical sum of all peak detection results in the analysis frame signal, and determining the numerical sum as the peak-to-valley ratio characteristic value of the current frame sub-signal.

In some embodiments, after the switching the processing operation on the audio signal from the first processing operation to the second processing operation, further comprising:

and switching the processing operation of the audio signal from the second processing operation to the first processing operation in response to the duration of the second processing operation being greater than the preset duration threshold.

In a second aspect, the present disclosure provides an audio signal processing apparatus, including:

an acquisition module configured to acquire an audio signal picked up by a microphone;

a detection module configured to detect the audio signal and switch a processing operation of the audio signal from a first processing operation to a second processing operation in response to detection of a target test signal from the audio signal.

In some embodiments, the detection module comprises:

a switching sub-module configured to switch a noise reduction processing operation on the audio signal from an on state to an off state; and/or switching a suppression parameter of an echo suppression operation on the audio signal from a first parameter to a second parameter.

In some embodiments, the audio signal comprises a multi-frame sub-signal that is consecutive in the time domain; the detection module comprises:

an obtaining sub-module configured to obtain a peak-to-valley ratio feature value of a current frame sub-signal;

a determination sub-module configured to determine that the target test signal is detected in response to the peak-to-valley ratio characteristic value not being less than a first preset threshold.

In some embodiments, the acquisition submodule is specifically configured to:

In a third aspect, the disclosed embodiments provide an electronic device, including:

a processor; and

a memory storing computer instructions for causing a processor to perform the method according to any of the embodiments of the first aspect.

In a fourth aspect, the embodiments of the present disclosure provide a storage medium storing computer instructions for causing a computer to execute the method according to any one of the embodiments of the first aspect.

The audio signal processing method of the disclosed embodiment includes acquiring an audio signal picked up by a microphone, detecting the audio signal, and switching a processing operation on the audio signal from a first processing operation to a second processing operation in response to detection of a target test signal from the audio signal. By detecting the target test signal, whether the current scene of the system is an actual use scene or a test scene is determined, different processing operations are switched according to different scenes, and the requirement for differentiation of the two scenes is met. And corresponding processing operation is set for the acoustic test scene, so that the reliability and the test result of the acoustic test scene can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

FIG. 2 is a schematic diagram of an acoustic test scenario in accordance with some embodiments of the present disclosure.

Fig. 3 is a graph of power spectra of multitone signals in some embodiments according to the present disclosure.

Fig. 4 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 5 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 6 is a flow chart of an audio signal processing method in some embodiments according to the present disclosure.

Fig. 7 is a block diagram of an audio signal processing apparatus according to some embodiments of the present disclosure.

Fig. 8 is a block diagram of an audio signal processing apparatus according to some embodiments of the present disclosure.

Fig. 9 is a block diagram of an electronic device suitable for implementing the audio signal processing method of the present disclosure.

Detailed Description

The technical solutions of the present disclosure will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure. In addition, technical features involved in different embodiments of the present disclosure described below may be combined with each other as long as they do not conflict with each other.

The acoustic characteristic test of a telephone terminal specified by 3GPP (3rd Generation Partnership Project) includes: loudness evaluation values, sensitivity/frequency characteristics, distortion, TMOS, acoustic echo control, two-way call performance, 3Quest testing, etc. Meanwhile, the acoustic tests define test environments, test equipment, test configurations and the like of different use scenes.

The audio signal for acoustic testing consists essentially of two parts: speech/noise signals, artificially synthesized test signals, typically test signals placed before speech/noise signals as preamble signals. In the related art, some test methods use a single-frequency tone signal generated by a sine wave as a test signal, such as a specific distortion test; some test methods use a combination of single tone signals at multiple frequencies into a multi-tone signal as the test signal, such as a delay test and an acoustic echo control test.

For a voice communication system, the voice enhancement algorithm often includes the following parts: echo cancellation algorithms, noise estimation algorithms, noise cancellation algorithms, residual echo/noise suppression algorithms, etc. In a test scene, because a test signal in an audio signal belongs to a noise-like signal, the test signal can be eliminated and suppressed by a voice enhancement algorithm of the system, so that the acoustic test system cannot accurately detect the test signal, and the test failure or the test result is reduced. In addition, taking a mobile phone voice communication system as an example, in a normal scene, in order to ensure that near-end voice is not distorted, the strength and parameter setting of the system on residual echo suppression are conservative, so that the test result of the mobile phone voice communication system is poor in an acoustic test scene.

Based on the above, in the voice communication system in the related art, there is a difference in the requirements between the acoustic test scenario and the actual application scenario, which results in poor reliability and test result of the system in the acoustic test scenario.

Based on the above-mentioned drawbacks in the related art, the embodiments of the present disclosure provide an audio signal processing method, an audio signal processing apparatus, an electronic device, and a storage medium, and aim to improve the reliability and the test result of a sound test for an acoustic test scenario.

In a first aspect, the embodiments of the present disclosure provide an audio signal processing method, which can be applied to any electronic device having a voice communication system and executed by a processor of the electronic device, such as a smart phone, a tablet computer, a notebook computer, and the like, and the present disclosure is not limited thereto.

As shown in fig. 1, in some embodiments, an audio signal processing method of an example of the present disclosure includes:

and S110, acquiring an audio signal picked up by a microphone.

And S120, detecting the audio signal, and switching the processing operation of the audio signal from the first processing operation to the second processing operation in response to the detection of the target test signal in the audio signal.

Fig. 2 shows a schematic diagram of the acoustic testing of an electronic device, and as shown in fig. 2, the electronic device 20 is placed in an enclosure of an acoustic testing system, such as an acoustic testing cabinet. A plurality of loudspeakers 11 with different spatial positions are arranged in the test box, each loudspeaker 11 is connected with a sound simulation device 10, and the sound simulation devices 10 can generate various analog audio signals and play the audio signals through each loudspeaker 11. The electronic device 20 is connected to the acoustic measurement apparatus 30, so that the microphone 21 of the electronic device 20 can pick up the mixed audio signal, and after the mixed audio signal is processed by the voice communication system of the electronic device 20, the acoustic measurement apparatus 30 can obtain an acoustic test result or a test result according to the signal processed by the electronic device 20.

As can be seen from the test scenario shown in fig. 2, the microphone 21 of the electronic device 20 can pick up the audio signal generated by the test system simulation. In a test scenario, the audio signal to which the microphone 21 is adapted may mainly comprise two parts: one is the speech/noise signal comprising near-end speech and background noise, and the other is the target test signal preceding the speech/noise signal.

The electronic device 20 may perform signal detection on the audio signal after picking up the audio signal, and in case a target test signal is detected from the audio signal, it indicates that the electronic device is currently in an acoustic test scenario, so that the electronic device switches the processing operation on the audio signal from the first processing operation to the second processing operation.

Based on the foregoing, there is a differentiated need for noise reduction and echo cancellation in a voice communication system of an electronic device in an actual use scenario and an acoustic test scenario. For example, in a test scenario, in order to prevent a target test signal from being eliminated by a noise reduction algorithm, noise reduction processing is not required to be performed on an audio signal; in an actual use scenario, in order to ensure communication quality, noise reduction processing needs to be performed on an audio signal. For another example, in a test scenario, in order to improve the test result of echo cancellation, deep cancellation processing needs to be performed on echoes; in an actual use scenario, in order to avoid distortion of a near-end speech signal, the parameter setting for echo cancellation is relatively conservative.

Thus, in the embodiments of the present disclosure, the first processing operation may refer to a noise reduction and/or echo cancellation operation of the voice communication system on the audio signal in a normal usage scenario, for example, switching the noise reduction processing operation on the audio signal to an on state, and/or switching a suppression parameter of the echo suppression operation on the audio signal to the first parameter. While the second processing operation is aware of noise reduction and/or echo cancellation operations of the voice communication system on the audio signal under the test scenario. For example, the noise reduction processing operation on the audio signal is switched to the off state, and/or the suppression parameter of the echo suppression operation on the audio signal is switched to the second parameter. It is to be understood that the first parameter and the second parameter are parameters of an echo suppression algorithm that suppress the residual echo to different degrees. The current scene of the electronic equipment is confirmed by detecting the target test signal in the audio signal, and then the corresponding processing operation on the audio signal is switched.

The process of detecting a target test signal in an audio signal is specifically described in the following embodiments of the present disclosure, and will not be described in detail here.

Therefore, the audio signal processing method in the embodiment of the disclosure determines whether the current scene of the system is an actual use scene or a test scene by detecting the target test signal, switches different processing operations according to different scenes, and meets the differentiation requirements of the two scenes. And corresponding processing operation is set for the acoustic test scene, so that the reliability and the test result of the acoustic test scene can be improved.

It should be noted that in the acoustic test scenario, the target test signal generated by the sound simulator 10 is typically a single-frequency tone signal or a multi-frequency tone signal synthesized by a plurality of single-frequency tones. The inventor finds that whether the single-frequency tone signal or the multi-frequency tone signal has obvious peak-to-valley ratio on the power spectrum, namely the ratio of the peak to the trough is large.

For example, fig. 3 shows a power spectrum diagram of a multi-tone signal composed of a plurality of single tones, and it can be known from fig. 3 that each peak and valley has a distinct peak-to-valley ratio characteristic value. Thus, in some embodiments of the present disclosure, a target test signal may be determined based on a peak-to-valley ratio, as described in more detail below in connection with the embodiments.

It is understood that the audio signal picked up by the microphone is a continuous signal in the time domain, and for the convenience of audio signal processing, the continuous audio signal in the time domain can be divided into continuous multiframe sub-signals. In one example, taking the example of a vocoder in a smartphone, the 3GPP specified continuous speech signal may be divided into 20ms frames, each 20ms frame consisting of two 10ms subframes. Of course, those skilled in the art will appreciate that the audio signal may be divided into other millisecond frames according to specific needs, and the present disclosure is not limited thereto.

As shown in fig. 4, in some embodiments, in the audio signal processing method of the examples of the present disclosure, the process of detecting the target test signal may include:

and S410, acquiring a peak-to-valley ratio characteristic value of the current frame subsignal.

And S420, determining that the target test signal is detected in response to the fact that the peak-to-valley ratio characteristic value is not smaller than a first preset threshold value.

Specifically, the peak-to-valley ratio feature value represents feature information of a peak-to-valley ratio in the current frame sub-signal. In one example, the peak-to-valley ratio feature value of the current frame sub-signal may be detected based on the current frame sub-signal.

In another example, the peak-to-valley ratio characteristic value of the current array sub-signal may be detected based on the current frame sub-signal and a preset number of frame sub-signals before the current frame, which will be described in the following embodiments of the present disclosure and will not be described in detail here.

The first preset threshold is a preset threshold indicating that the current frame sub-signal is a single-frequency tone or multi-frequency tone signal, and the first preset threshold may be obtained in advance according to priori knowledge or limited experiments, which is not limited by the present disclosure.

Therefore, if the peak-to-valley ratio characteristic value is not less than the first preset threshold value, the current frame sub-signal is a single-frequency tone or a multi-frequency tone signal, that is, it is determined that the target test signal is detected, and the electronic device is in an acoustic test scene. If the peak-to-valley ratio characteristic value is smaller than the first preset threshold value, it indicates that the peak-to-valley ratio of the sub-signal of the current frame is not obvious and is not a target test signal, and the electronic device is in a normal use scene.

In some embodiments, to meet the frequency resolution requirement of target test signal detection, the analysis frame signal may be composed by frame splicing based on the current frame sub-signal and a preset number of frames of sub-signals before the current frame sub-signal, so as to improve the detection accuracy of the target test signal. The following describes an embodiment of the present disclosure with reference to fig. 5.

As shown in fig. 5, in some embodiments, in the audio signal processing method according to the examples of the present disclosure, the process of obtaining the peak-to-valley ratio feature value of the sub-signal of the current frame includes:

and S510, acquiring a power spectrum of the analysis frame signal.

Specifically, the analysis frame signal includes a current frame sub-signal and consecutive sub-signals a preset number of frames before the current frame sub-signal.

In one example, the audio signal is divided into 10ms frames, the analysis frame signal includes the current 10ms frame, and the signals of three consecutive 10ms frames before the current 10ms frame, i.e. the analysis frame signal is composed of 4 frame sub-signals.

In some embodiments, after the analysis frame signal is acquired, a power spectrum corresponding to the analysis frame signal can be calculated according to the data of the analysis frame signal. The following embodiments of the present disclosure will be explained, and will not be described in detail here.

S520, determining the peak-trough ratio of each peak and trough in the analysis frame signal according to the power spectrum.

In particular, referring to the power spectrum shown in fig. 3, it can be seen that the power spectrum includes at least one peak, so that the ratio of each peak to trough, i.e. the peak to trough ratio, can be determined.

In some embodiments, the positions and widths of the peaks and valleys can be labeled by an off-line labeling method, and the peak-to-valley ratio is determined according to the energy sum of each peak and valley. Specific calculation procedures are specifically described in the following of the present disclosure, and will not be described in detail here.

S530, determining the peak-to-valley ratio characteristic of the current frame subsignal based on the peak-to-valley ratio.

Specifically, after determining each peak-to-valley ratio in the analysis frame signal, each peak-to-valley ratio may be determined based on a second preset threshold set in advance. According to the foregoing, each peak and trough in the target test signal has an obvious peak-to-trough ratio, so that a second preset threshold value can be predetermined based on priori knowledge or limited experiments, and the second preset threshold value indicates that the peak corresponding to the peak-to-trough ratio is the threshold value of the single-frequency sound peak.

When the ratio of a certain peak to a trough is not less than a second preset threshold, the peak corresponding to the ratio of the peak to the trough is a single-frequency sound peak, so that the peak detection result of the peak can be determined as a first numerical value.

And when the specific value of a certain peak and trough is smaller than a second preset threshold value, the peak corresponding to the specific value of the peak and trough is relatively gentle and is not a single-frequency sound peak, so that the peak detection result of the peak can be determined as a second numerical value.

In one example, the first value may be set to 1 and the second value may be set to 0. Of course, one skilled in the art will appreciate that the first and second values may be other values, and the disclosure is not limited thereto.

After each peak detection result in the analyzed frame signal is represented by the first value and the second data, all peak detection results can be summed to obtain a value sum, and the value sum is determined as a peak-to-valley ratio characteristic value of the current frame sub-signal.

After determining the peak-to-valley ratio characteristic value of the current frame sub-signal, whether the current frame sub-signal is a target test signal can be further judged according to the peak-to-valley ratio characteristic value. Those skilled in the art can refer to the foregoing step S420, which is not described in detail herein.

Therefore, in the embodiment of the present disclosure, the analysis frame signal adopts a continuous multi-frame sub-signal including the current frame sub-signal, and the peak-to-valley ratio of the analysis frame signal is used to determine whether the current frame sub-signal is the target test signal, so as to improve the frequency resolution of the target test signal detection, and further improve the detection accuracy.

In some embodiments, when the peak-to-valley ratio characteristic value of the current frame sub-signal is not less than the first preset threshold, it indicates that the current frame sub-signal is the target test signal, and the electronic device is currently in the acoustic test scene, so as to perform the second processing operation on the audio signal after the current frame sub-signal. And when the peak-to-valley ratio characteristic value of the current frame sub-signal is smaller than a first preset threshold value, the current frame sub-signal is not a target test signal, and the electronic equipment is currently in a normal use scene, so that a first processing operation is performed on the audio signal behind the current frame sub-signal.

It can be seen that the audio signal processing method according to the embodiment of the present disclosure determines, through detection of the target test signal, whether the current scene of the system is an actual use scene or a test scene, switches different processing operations for different scenes, and simultaneously meets the differentiation requirements of the two scenes. And corresponding processing operation is set for the acoustic test scene, so that the reliability and the test result of the acoustic test scene can be improved.

In some embodiments, after detecting the target test signal, the system switches the processing operation on the audio signal from the first processing operation to the second processing operation. In the embodiment of the present disclosure, considering that an acoustic test scenario is generally a short-time test, in order to ensure normal use of the electronic device, it is necessary to switch the processing operation of the voice communication system on the audio signal from the second processing operation to the first processing operation.

In one example, a preset duration threshold, such as 30 seconds or the like, may be set in advance based on a priori knowledge or actual scene requirements. The preset duration threshold value represents a threshold value for switching the processing operation of the voice communication system on the audio signal from the second processing operation to the first processing operation.

When the duration of the second processing operation of the audio signal is greater than the preset duration threshold, the result of the current acoustic test scene of the electronic device is represented, so that the processing operation of the audio signal can be switched from the second processing operation to the first processing operation. And when the duration of the second processing operation of the audio signal is not greater than the preset duration threshold, the current electronic equipment is still in the acoustic test scene, so that the second processing operation of the audio signal is maintained.

Therefore, the audio signal processing method disclosed by the embodiment of the disclosure can automatically switch the processing operation on the audio signal based on the duration of the second processing operation of the system, ensure the effect of the voice communication system in normal use, and improve the reliability of the system.

Fig. 6 shows a specific embodiment of the audio signal processing method of the present disclosure, and in this embodiment, a mobile phone acoustic test scenario is taken as an example, and the following description is made specifically.

As shown in fig. 6, in an embodiment of the present disclosure, an audio signal processing method includes:

and S601, acquiring an audio signal picked up by a microphone.

Specifically, in an acoustic test scenario such as that shown in fig. 2, the handset microphone may pick up an audio signal played by the speaker 11.

And S602, acquiring a power spectrum of the analysis frame signal.

Specifically, the vocoder in the mobile phone divides the continuous audio signal into 20ms frames, each 20ms frame is composed of two 10ms subframes, and the frame of the sub-signal in this embodiment is a 10ms subframe.

For the narrow-band and wide-band signals in the acoustic test of the mobile phone, the sampling length of each frame of the sub-signal is respectively 80 and 160 sampling points. In order to improve the frequency resolution of the target test signal, the analysis frame signal of the embodiment uses 4 frame sub-signals, that is, the current frame sub-signal and 3 consecutive frame sub-signals before the current frame sub-signal, and the lengths of the narrowband signal and the wideband signal corresponding to the analysis frame signal are 320 sampling points and 640 sampling points, respectively. The power spectrum Xa2 of the analysis frame signal can be expressed as:

X＝fft(x.*win)

wherein, x is an analysis frame signal, and the overlap method can be used for splicing frames before and after, namely, the historical 3 frame sub-signal and the current frame sub-signal form x. N is the analysis frame length, Xa2 is the power spectrum, win is the short analysis window, and its expression is:

win＝0.5*[1-cos(2π*n/N)]，n＝0，1，…N-1

s603, determining the peak-trough ratio of each peak and trough in the analysis frame signal according to the power spectrum.

As can be seen from the power spectrum diagram shown in fig. 3, the power spectrum includes a plurality of peaks and troughs, so that in an example, the positions and widths of the peaks and troughs may be marked on the power spectrum in advance by an offline marking method, so as to reduce the amount of calculation for searching for the peaks.

For any one peak, calculating the peak-to-valley ratio may include:

for any peak, determining a first energy sum of the peak, a second energy sum of a previous trough adjacent to the peak and a third energy sum of a next trough adjacent to the peak according to the power spectrum;

and determining the peak-to-valley ratio according to the first energy sum, the second energy sum and the third energy sum.

Specifically, the sum of the energies of the peaks can be expressed as:

the sum of the energies of the valleys can be expressed as:

where i is the index of each peak, and i is 0, 1, … n. P_peakIs the energy sum of the peaks, kpl is the peak start bin position, kph is the peak cut bin position. Same P_troughIs the energy sum of the valleys, ktl is the peak start bin position, kth is the peak cut bin position.

Thus, the peak to valley ratio is expressed as:

where ptr is the peak to valley ratio and δ is a small positive number used to prevent divide-by-zero anomalies. P_peak(i) Representing the ith peak. P_trough(i) Denotes a trough, P, preceding the ith peak_trough(i +1) represents a trough after the ith peak. Therefore, the peak-to-trough ratio of each peak and trough in the analysis frame signal can be calculated by the above formula.

S604, judging whether the ratio of each wave crest to each wave trough is not less than a second preset threshold value. If yes, go to S605. If not, go to S606.

S605, determining that the peak detection result corresponding to the peak-to-trough ratio is 1.

S606, determining the peak detection result corresponding to the peak-to-trough ratio to be 0.

Specifically, after determining each peak-to-valley ratio in the analysis frame signal, each peak-to-valley ratio may be determined based on a second preset threshold set in advance.

When the ratio of a certain peak to a trough is not less than a second preset threshold, it indicates that the peak corresponding to the ratio of the peak to the trough is a single-frequency sound peak, so that the peak detection result of the peak can be determined to be 1. And when the specific value of a certain peak and a trough is smaller than a second preset threshold value, the peak corresponding to the specific value of the peak and the trough is relatively gentle and is not a single-frequency sound peak, so that the peak detection result of the peak can be determined to be 0. It can be expressed as:

here, plocal (i) is the detection result of each peak, and tlocal (i) is a second predetermined threshold.

S607, determining the numerical sum of all peak detection results in the analysis frame signal, and determining the numerical sum as the peak-to-valley ratio characteristic value of the current frame sub-signal.

Specifically, in S606, each peak detection result of the analysis frame signal is represented by 1 and 0, and all peak detection results may be summed to obtain a numerical sum, which is determined as the peak-to-valley ratio characteristic value of the current frame sub-signal. It can be expressed as:

where, the pframe is the sum of the values of all peak detection results.

S608, judging whether the peak-to-valley ratio characteristic value of the current frame sub-signal is not less than a first preset threshold value. If yes, go to S609. If not, go to S610.

And S609, executing a second processing operation on the audio signal.

S610, first processing operation is carried out on the audio signal.

Specifically, after obtaining the peak-to-valley ratio characteristic value of the current frame sub-signal, the peak-to-valley ratio characteristic value may be determined based on a preset first preset threshold to determine whether the current frame sub-signal is the target test signal. Can be expressed as:

wherein prob represents a detection result of the multi-frequency signal, and T is a first preset threshold.

In this embodiment, if the peak-to-valley ratio characteristic value is not less than the first preset threshold, it indicates that the current sub-signal is a mono-tone signal or a multi-tone signal, that is, it is determined that the target test signal is detected, and the electronic device is in an acoustic test scene. At this time, the second processing operation may be performed on the audio signal.

In one example, the second processing operation includes switching the noise reduction processing operation on the audio signal to an off state, and switching the suppression parameter of the echo suppression operation on the audio signal to the second parameter.

If the peak-to-valley ratio characteristic value is smaller than the first preset threshold value, it indicates that the peak-to-valley ratio of the sub-signal of the current frame is not obvious and is not a target test signal, and the electronic device is in a normal use scene. At this time, a first processing operation may be performed on the audio signal.

In one example, the first processing operation includes switching a noise reduction processing operation on the audio signal to an on state and switching a suppression parameter of an echo suppression operation on the audio signal to the first parameter. The second parameter is less strongly suppressed than the first parameter for the residual echo.

It should be noted that, in general, the enhancement algorithm for the audio signal by the mobile phone voice communication system includes: echo cancellation algorithms, noise estimation algorithms, noise cancellation algorithms, residual echo/noise suppression algorithms, etc.

The echo cancellation algorithm can be implemented by using an adaptive filter, such as NLMS (Normalized least mean square) method, which can be expressed as:

w(n)＝w(n-1)+μx(n)e(n)/(x^T(n)x(n)+δ)

where n is the time instant represented by the sample, x (n) is the reference signal,

is the echo estimate, y (n) is the audio signal picked up by the microphone, e (n) is the residual signal after the linear echo is cancelled, w (n) is the adaptive filter, μ is the filter adaptation step size, and δ is a positive constant that prevents the division from zeros.

The noise estimation can adopt a continuous spectrum minimum tracking method, and is represented as:

where l denotes a frame, k denotes a frequency point,

is the noise power spectrum, λ_yIs the power spectrum of the audio signal picked up by the microphone. γ and β are used to control the noise tracking speed.

The residual echo estimation may be performed in a first order recursive smoothing.

Wherein λ_eIs the residual echo estimate, and E (l, k) is the short-time fourier transform of the residual echo. Alpha is alpha_eIndicating a forgetting factor, 0 < alpha < 1, with smaller values to better track residual echo when a test signal is detected, and larger values to reduce near-end impairments otherwise.

Noise and residual echo suppression may employ a wiener filtering method, which is expressed as:

wherein

Is the sum of the noise estimate and the residual echo estimate power spectrum, and when a target test signal is detected,

to ensure that noise suppression does not affect the acoustic test. Gamma (l, k) is the posterior signal-to-interference ratio,

is obtained by calculating the prior signal-to-interference ratio by adopting a decision-oriented method,

is the target speech signal estimated from the previous frame. G (l, k) is a gain function and Δ is used to control the suppression strength, with larger values being used to suppress more residual echo when the target test signal is detected and smaller values otherwise being used to reduce near-end speech loss.

In a second aspect, the embodiments of the present disclosure provide an audio signal processing apparatus, which may be applied to any electronic device with a voice communication system, such as a smart phone, a tablet computer, a notebook computer, and the like, and the disclosure is not limited thereto.

As shown in fig. 7, in some embodiments, an audio signal processing apparatus of an example of the present disclosure includes:

an acquisition module 701 configured to acquire an audio signal picked up by a microphone;

a detection module 702 configured to detect the audio signal, and in response to detecting a target test signal from the audio signal, switch a processing operation on the audio signal from a first processing operation to a second processing operation.

By the aid of the audio signal processing device, the current scene of the system is determined to be an actual use scene or a test scene through detection of the target test signal, different processing operations are switched according to different scenes, and differentiation requirements of the two scenes are met. And corresponding processing operation is set for the acoustic test scene, so that the reliability and the test result of the acoustic test scene can be improved.

As shown in fig. 8, in some embodiments, the detection module 702 includes:

a switching sub-module 703 configured to switch a noise reduction processing operation on the audio signal from an on state to an off state; and/or switching a suppression parameter of an echo suppression operation on the audio signal from a first parameter to a second parameter.

In some embodiments, the audio signal comprises a multi-frame sub-signal that is consecutive in the time domain; the detection module 702 includes:

an obtaining sub-module 704 configured to obtain a peak-to-valley ratio feature value of the current frame sub-signal;

a determining sub-module 705 configured to determine that the target test signal is detected in response to the peak-to-valley ratio characteristic value not being less than a first preset threshold.

In some embodiments, the acquisition submodule 704 is specifically configured to:

according to the power spectrum, determining the peak-to-trough ratio of each peak and trough in the analysis frame signal:

a processor; and

Fig. 9 is a block diagram of an electronic device according to some embodiments of the present disclosure, and the following describes principles related to the electronic device and a storage medium according to some embodiments of the present disclosure with reference to fig. 9.

Referring to fig. 9, the electronic device 1800 may include one or more of the following components: processing component 1802, memory 1804, power component 1806, multimedia component 1808, audio component 1810, input/output (I/O) interface 1812, sensor component 1816, and communications component 1818.

The processing component 1802 generally controls the overall operation of the electronic device 1800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 1802 may include one or more processors 1820 to execute instructions. Further, the processing component 1802 may include one or more modules that facilitate interaction between the processing component 1802 and other components. For example, the processing component 1802 can include a multimedia module to facilitate interaction between the multimedia component 1808 and the processing component 1802. As another example, the processing component 1802 can read executable instructions from a memory to implement electronic device related functions.

The memory 1804 is configured to store various types of data to support operation at the electronic device 1800. Examples of such data include instructions for any application or method operating on the electronic device 1800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1806 provides power to various components of the electronic device 1800. The power components 1806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 1800.

The multimedia component 1808 includes a display screen that provides an output interface between the electronic device 1800 and a user. In some embodiments, the multimedia component 1808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera can receive external multimedia data when the electronic device 1800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

Audio component 1810 is configured to output and/or input audio signals. For example, the audio component 1810 can include a Microphone (MIC) that can be configured to receive external audio signals when the electronic device 1800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1804 or transmitted via the communication component 1818. In some embodiments, audio component 1810 also includes a speaker for outputting audio signals.

I/O interface 1812 provides an interface between processing component 1802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 1816 includes one or more sensors to provide status evaluations of various aspects for the electronic device 1800. For example, the sensor component 1816 can detect an open/closed state of the electronic device 1800, the relative positioning of components such as a display and keypad of the electronic device 1800, the sensor component 1816 can also detect a change in position of the electronic device 1800 or a component of the electronic device 1800, the presence or absence of user contact with the electronic device 1800, orientation or acceleration/deceleration of the electronic device 1800, and a change in temperature of the electronic device 1800. Sensor assembly 1816 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1816 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1816 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1818 is configured to facilitate communications between the electronic device 1800 and other devices in a wired or wireless manner. The electronic device 1800 may access a wireless network based on a communication standard, such as Wi-Fi, 2G, 3G, 4G, 5G, or 6G, or a combination thereof. In an exemplary embodiment, the communication component 1818 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1818 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components.

It should be understood that the above embodiments are only examples for clearly illustrating the present invention, and are not intended to limit the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the present disclosure may be made without departing from the scope of the present disclosure.

Claims

1. An audio signal processing method, comprising:

acquiring an audio signal picked up by a microphone;

2. The method of claim 1, wherein switching the processing operation on the audio signal from a first processing operation to a second processing operation comprises:

and/or the presence of a gas in the gas,

3. The method according to claim 1 or 2, characterized in that the audio signal comprises a multi-frame sub-signal consecutive in the time domain; detecting a target test signal from the audio signal, comprising:

4. The method of claim 3, wherein obtaining the peak-to-valley ratio characteristic value of the current frame sub-signal comprises:

5. The method of claim 4, wherein determining a peak-to-valley ratio of each peak and valley in the analysis frame signal from the power spectrum comprises:

6. The method of claim 4, wherein said determining said peak-to-valley ratio characteristic of said current frame subsignal based on respective peak-to-valley ratios comprises:

7. The method of claim 1, further comprising, after the switching the processing operation on the audio signal from the first processing operation to the second processing operation:

8. An audio signal processing apparatus, comprising:

9. The apparatus of claim 8, wherein the detection module comprises:

10. The apparatus according to claim 8 or 9, wherein the audio signal comprises a multi-frame sub-signal that is consecutive in the time domain; the detection module comprises:

11. The apparatus of claim 10, wherein the acquisition submodule is specifically configured to:

12. The apparatus of claim 11, wherein the acquisition submodule is specifically configured to:

13. The apparatus of claim 11, wherein the acquisition submodule is specifically configured to:

14. An electronic device, comprising:

a processor; and

memory storing computer instructions for causing a processor to perform the method according to any one of claims 1 to 7.

15. A storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1 to 7.