CN112037825A

CN112037825A - Audio signal processing method and device and storage medium

Info

Publication number: CN112037825A
Application number: CN202010797931.4A
Authority: CN
Inventors: 何梦楠
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-12-04
Anticipated expiration: 2040-08-10
Also published as: CN112037825B

Abstract

The disclosure relates to a method and a device for processing an audio signal and a storage medium. Collecting audio signals by using at least two audio collecting channels; wherein the audio acquisition channels comprise: a target channel and an alternate channel; the audio signal collected by the alternative channel is used for carrying out noise filtering on the audio signal collected by the target channel; determining that the signal energy of the target channel meets a preset energy condition according to the signal energy of the acquired audio signal; and when the signal energy of the target channel meets a preset energy condition, switching the audio acquisition channel corresponding to the target channel. Through the technical scheme of the embodiment of the disclosure, a plurality of audio acquisition channels of the electronic equipment can be flexibly applied, and the audio quality is improved.

Description

Audio signal processing method and device and storage medium

Technical Field

The present disclosure relates to signal processing technologies, and in particular, to a method and an apparatus for processing an audio signal, and a storage medium.

Background

With the development of electronic devices, various mobile terminals and intelligent electronic devices gradually have more functions and applications, and therefore, the human-computer interaction function of the electronic devices is also more and more powerful. For audio acquisition, the electronic equipment can have a multi-channel acquisition mode, which is beneficial to noise filtering, echo cancellation, sound source positioning and other functions, and has higher practicability and stability. However, in the multi-channel audio acquisition mode, there may be a phenomenon that a part of channels break due to an excessive volume of a sound source, and the sound quality effect is poor.

Disclosure of Invention

The disclosure provides a method and a device for processing an audio signal and a storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a method for processing an audio signal, including:

collecting audio signals by using at least two audio collecting channels; wherein the audio acquisition channels comprise: a target channel and an alternate channel; the audio signal collected by the alternative channel is used for carrying out noise filtering on the audio signal collected by the target channel;

determining that the signal energy of the target channel meets a preset energy condition according to the signal energy of the acquired audio signal;

and when the signal energy of the target channel meets a preset energy condition, switching the audio acquisition channel corresponding to the target channel.

In some embodiments, the switching the audio acquisition channel corresponding to the target channel when the signal energy of the target channel satisfies a preset energy condition includes:

and if the signal energy of the target channel exceeds the product of the signal energy of the alternative channel and a preset reference threshold, switching the audio acquisition channel corresponding to the target channel.

In some embodiments, the switching the audio acquisition channel corresponding to the target channel if the signal energy of the target channel exceeds a product of the signal energy of the alternative channel and a preset reference threshold includes:

and if the signal energy of the target channel exceeds the product of the signal energy of the alternative channel and a preset reference threshold value in at least N frames of the time domain, switching the audio acquisition channel corresponding to the target channel, wherein N is an integer greater than or equal to 1, and the signal energy is the energy of the audio signal in a preset frequency band.

if the signal energy of the target channel meets a preset energy condition, determining the cross-correlation coefficient between the audio signals acquired by the at least two audio acquisition channels and a preset reference signal;

and if the cross-correlation coefficients of the audio signals acquired by the at least two audio acquisition channels and the preset reference signal meet a preset correlation condition, switching the audio acquisition channel corresponding to the target channel.

In some embodiments, the switching the audio acquisition channel corresponding to the target channel if the cross-correlation coefficients of the audio signals acquired by the at least two audio acquisition channels and the reference signal satisfy a preset correlation condition includes:

if the cross-correlation coefficient between the audio signal acquired by the target channel and the preset reference signal is smaller than a first correlation threshold value and the cross-correlation coefficient between the audio signal acquired by the alternative channel and the preset reference signal is larger than a second correlation threshold value in at least M frames of a time domain, switching the audio acquisition channel corresponding to the target channel; wherein M is a positive integer greater than or equal to 1.

In some embodiments, the determining the cross-correlation coefficient between the audio signals acquired by the at least two audio acquisition channels and a preset reference signal comprises:

determining first autocorrelation coefficients of audio signals acquired by the at least two audio acquisition channels;

determining a second autocorrelation coefficient of the reference signal; the reference signal is an audio signal sent by the electronic equipment where the at least two audio acquisition channels are located;

and determining the cross-correlation coefficient of the audio signal acquired by each audio acquisition channel and the reference signal according to the first autocorrelation coefficient and the second autocorrelation coefficient of each audio acquisition channel.

In some embodiments, the method further comprises:

acquiring frequency domain noisy signals corresponding to the audio signals acquired by the at least two audio acquisition channels;

determining the power spectral densities of the audio signals acquired by the at least two audio acquisition channels at each frequency point according to the frequency domain noisy signals;

and determining the signal energy according to the weighted sum of the power spectral density of the audio signal at each frequency point according to a preset weight.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for processing an audio signal, including:

the acquisition module is used for acquiring audio signals by utilizing at least two audio acquisition channels; wherein the audio acquisition channels comprise: a target channel and an alternate channel; the audio signal collected by the alternative channel is used for carrying out noise filtering on the audio signal collected by the target channel;

the first determining module is used for determining that the signal energy of the target channel meets a preset energy condition according to the signal energy of the collected audio signal;

and the switching module is used for switching the audio acquisition channel corresponding to the target channel when the signal energy of the target channel meets a preset energy condition.

In some embodiments, the switching module comprises:

and the first switching submodule is used for switching the audio acquisition channel corresponding to the target channel if the signal energy of the target channel exceeds the product of the signal energy of the alternative channel and a preset reference threshold.

In some embodiments, the first switching submodule is specifically configured to:

In some embodiments, the switching module comprises:

the determining submodule is used for determining the cross-correlation coefficient between the audio signals acquired by the at least two audio acquisition channels and a preset reference signal if the signal energy of the target channel meets a preset energy condition;

and the second switching submodule is used for switching the audio acquisition channel corresponding to the target channel if the cross-correlation coefficient between the audio signals acquired by the at least two audio acquisition channels and the preset reference signal meets a preset correlation condition.

In some embodiments, the second switching submodule is specifically configured to:

In some embodiments, the determining sub-module comprises:

a first determining unit, configured to determine first autocorrelation coefficients of the audio signals acquired by the at least two audio acquisition channels;

a second determining unit for determining a second autocorrelation coefficient of the reference signal; the reference signal is an audio signal sent by the electronic equipment where the at least two audio acquisition channels are located;

and the third determining unit is used for determining the cross-correlation coefficient between the audio signal acquired by each audio acquisition channel and the reference signal according to the first autocorrelation coefficient and the second autocorrelation coefficient of each audio acquisition channel.

In some embodiments, the apparatus further comprises:

the acquisition module is used for acquiring frequency domain noisy signals corresponding to the audio signals acquired by the at least two audio acquisition channels;

the second determining module is used for determining the power spectral densities of the audio signals acquired by the at least two audio acquisition channels at each frequency point according to the frequency domain noisy signals;

and the third determining module is used for determining the signal energy according to the weighted sum of the power spectral densities of the audio signals at all the frequency points according to the preset weight.

According to a third aspect of the present disclosure, there is provided an apparatus for processing an audio signal, the apparatus comprising at least: a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is configured to execute the executable instructions, and the executable instructions perform the steps of any of the audio signal processing methods.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the steps in any of the above-described methods of processing an audio signal.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: through the technical scheme of the embodiment of the disclosure, whether the target channel meets the requirement or not is judged according to the signal energy of the audio signal collected by each audio collecting channel, and then the switching of the target channel is realized. Therefore, the target channel can be switched when the signal energy of the audio signal collected by the target channel cannot meet the requirement, and the signal quality of the target channel is improved. For example, when the signal energy of the target channel is too large to cause the sound breaking phenomenon, the target channel can be switched to reduce the sound breaking phenomenon; when the target channel is far away from the sound source and cannot collect clear audio signals, the target channel can be switched so as to collect clear audio signals conveniently; for another example, when the signal energy is not distributed in the predetermined frequency band range according to the requirement, the target channel is switched, so that the audio signal acquired by the target channel meets the requirement of the electronic device on the audio signal as a whole as much as possible.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a first flowchart illustrating a method of processing an audio signal according to an exemplary embodiment;

FIG. 2 is a flowchart II illustrating a method of processing an audio signal according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating a terminal with two audio acquisition channels in accordance with an exemplary embodiment;

fig. 4 is a flowchart three illustrating a method of processing an audio signal according to an exemplary embodiment;

fig. 5 is a block diagram illustrating a configuration of an audio signal processing apparatus according to an exemplary embodiment;

fig. 6 is a block diagram illustrating a physical structure of an audio signal processing apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a method of processing an audio signal according to an exemplary embodiment, as shown in fig. 1, the method including the steps of:

s101, collecting audio signals by using at least two audio collecting channels; wherein the audio acquisition channels comprise: a target channel and an alternate channel; the audio signal collected by the alternative channel is used for carrying out noise filtering on the audio signal collected by the target channel;

step S102, determining that the signal energy of a target channel meets a preset energy condition according to the signal energy of the collected audio signal;

and S103, when the signal energy of the target channel meets a preset energy condition, switching the audio acquisition channel corresponding to the target channel.

The method of the embodiment of the present disclosure may be applied to a terminal, where the terminal may be an electronic device having an audio acquisition component (e.g., having a microphone), and the method includes: the mobile phone, the notebook computer, the video camera, the wearable electronic equipment and various electronic equipment with human-computer interaction capability. The electronic device may also be an electronic device having an audio file processing function, such as a computer and a sound device that do not have an audio capture function but can process an audio file.

The audio signal is processed to obtain higher signal quality and most of the noise of the audio signal is filtered. In the embodiment of the present disclosure, the terminal may have at least two audio capturing channels located at different positions, wherein at least one audio capturing channel may be provided as the target channel. The audio signal collected by the target channel can be used as the audio signal to be processed. And carrying out noise reduction treatment on the audio signal acquired by the target channel, namely filtering the noise to obtain the noise-reduced audio signal, and then, playing, transmitting or storing the noise-reduced audio signal.

The alternative channel may be a channel that assists the target channel in noise filtering. In addition, the alternative channel can be used as an auxiliary target channel to realize the functions of sound source signal separation, sound source position judgment and the like.

In the embodiment of the present disclosure, since the audio signal acquired by the target channel is the target signal, the audio signal acquired by the target channel needs to have higher signal quality. If the quality of the audio signal collected by the current target signal is poor, for example, a sound breaking phenomenon is easily generated, the volume is too low, or the noise is large, the target channel needs to be switched to improve the signal quality of the target signal.

Therefore, in the embodiment of the present disclosure, whether the signal energy satisfies the preset energy condition is determined by detecting the signal energy of the target channel, and if the preset energy condition is satisfied, the audio acquisition channel corresponding to the target channel may be switched, so as to find a more suitable audio acquisition channel as the target channel.

In an embodiment, when the terminal is powered on or a collection function of the audio collection channel is just started, the target channel and the alternative channel may be determined according to default settings of the system, for example, there is one audio collection channel above and below the mobile phone, and since the audio collection channel below is closer to the sound source during a general call, the audio collection channel below may be set as the default target channel, and the collection channel above is the default alternative channel.

In another embodiment, the target channel may also be determined according to the terminal posture detection, for example, the audio acquisition channel with the lowest physical height may be determined as the target channel according to the terminal posture detected by the posture sensor.

Of course, the initial target channel and the alternate channels may also be randomly identified. And then judging whether a target channel and an alternative channel need to be switched or not according to the signal energy of each audio acquisition channel in the audio acquisition process.

Here, the preset energy condition may be that the signal energy is outside a predetermined energy range, for example, the signal energy is lower than a minimum threshold value of the predetermined energy range, or the signal energy is higher than a maximum threshold value of the predetermined energy range. If the signal energy is within the preset energy range, the audio signal collected by the target channel may not have a sound breaking phenomenon, and the volume is clear; if the signal energy is lower than the minimum threshold, the sound volume may be too small to be heard by the user, and if the signal energy is higher than the maximum threshold, the sound volume may be too large to easily generate a sound breaking phenomenon, or easily influence the hearing noise of the user. It should be noted that the predetermined energy range may be set corresponding to a common range of human voice, or a comfortable range of human hearing, etc.

Of course, the above-mentioned case of too low signal energy may be that the environment in which the audio acquisition channel is located is inherently quiet and no sound source is sounding. Therefore, the preset energy condition may also include: the signal energy of the audio signal acquired by the target channel is below a predetermined minimum threshold and the signal energy of the audio signal acquired by the alternative channel is within a predetermined energy range. In this way, the target channel may also be switched when the target channel fails or is too far from the sound source. If the signal energy of the audio signal collected by the target channel and the signal energy of the audio signal collected by the alternative channel are both within the predetermined energy range or both are lower than the predetermined minimum threshold, the target channel may not be switched, and the current setting of the target channel is maintained.

Therefore, by the method, the target channel can be switched in real time through the energy detection of the audio signals collected by the audio collecting channels, so that the clear audio signals with proper volume can be conveniently obtained, and the quality of the whole audio signals is improved.

Here, a case where the above-mentioned preset energy condition is satisfied is provided, and whether the above-mentioned preset energy condition is satisfied is determined by a relationship between the signal energy of the target channel and the signal energy of the alternative channel. Here, the preset reference threshold may be determined according to the position relationship between the target channel and the candidate channel, for example, if the target channel and the candidate channel are closer, the energy between the two collected audio signals should be closer under normal conditions, and therefore, the preset reference threshold may be set at about 1. If the target channel is farther away from the candidate channel and the target channel is closer to the sound source, the signal energy of the audio signal collected by the target channel may be higher than that of the candidate channel under normal conditions, and then the preset reference threshold may be set to be larger, for example, larger than 2.

In addition, the preset reference threshold may also be determined according to the influence of the actual signal energy on the target channel on the sound quality effect, for example, in a general call scenario, if the signal energy of the target channel is greater than 3 times of the signal energy of the alternative channel, a sound breaking phenomenon may be generated, and a large influence may be brought to the sound quality effect. Therefore, the above-mentioned preset reference threshold value may be set to a value less than or equal to 3.

Therefore, the relation of signal energy among the signal channels is judged according to the preset reference threshold value, whether the target channel needs to be switched or not can be judged quickly, and the audio signal is kept to have a good tone quality effect as far as possible by switching the target channel.

Since glitches may occur in the audio signal collected in the target channel due to environmental factors or signal interference of the device itself during audio collection. The glitches are of very short duration and are not easily noticeable, so that the target channel does not need to be switched at this time. However, if the audio signal collected by the target channel is repeated many times or the signal energy is continuously too large, the sound quality of the audio signal may be greatly affected.

Therefore, in the embodiment of the present disclosure, the target channel may be switched again when the number of signal frames in which the signal energy of the target channel exceeds the product of the signal energy of the alternative channel and the preset reference threshold is greater than or equal to N frames.

It should be noted that if the value of N is large, the switching of the target channel may not be sensitive enough, but unnecessary repeated switching may be reduced, and if the value of N is small, the switching of the target channel is more sensitive, but repeated switching may be caused by glitches occurring in the audio signals acquired by each audio acquisition channel. Therefore, in practical applications, the value of N may be set according to the requirement for audio signal acquisition or the actual audio acquisition environment, which is not limited herein.

In some embodiments, as shown in fig. 2, in the step S103, when the signal energy of the target channel satisfies a preset energy condition, switching the audio capture channel corresponding to the target channel includes:

step S201, if the signal energy of the target channel meets a preset energy condition, determining the cross-correlation coefficient between the audio signals acquired by the at least two audio acquisition channels and a preset reference signal;

step S202, if the cross correlation coefficients of the audio signals acquired by the at least two audio acquisition channels and the preset reference signal meet a preset correlation condition, switching the audio acquisition channel corresponding to the target channel.

Since the target channel and the alternative channel may be far away from each other, there is a large difference between the acquired signal energies, and if the target channel is switched according to the signal energy alone, it may be difficult to acquire an audio signal of a sufficient volume when switching to the target channel because the alternative channel is far away from the sound source.

Therefore, in the embodiment of the present disclosure, when the signal energy of the target channel satisfies the preset energy condition, whether to switch the target channel is further determined according to correlations between the audio signals acquired by the target channel and the alternative channel respectively and the preset reference signal.

The preset reference signal may be an audio signal sent by the terminal itself, for example, when the terminal is in a call with a far end, the terminal receives an audio signal played by a voice signal of the other party, including an audio signal of the other party in a hands-free state. As another example, background music played by an application program running on the terminal itself, or an audio signal sent by the terminal system, etc.

Since the preset reference signal is a signal that the audio capturing channel should be able to capture, if the correlation between the audio signal captured by a part of the audio capturing channels and the preset reference signal is low, and the correlation between the audio signal captured by the other audio capturing channels and the preset reference signal is high, it is indicated that the audio capturing channel with low correlation may capture a large noise or the audio capturing channel may malfunction.

Therefore, in the embodiment of the present disclosure, when the signal energy of the target channel satisfies the preset energy condition, it is determined whether the correlation between the audio signal distribution acquired by the target channel based on the candidate channel and the preset reference signal satisfies the correlation condition, and the correlation condition is used as a trigger condition for switching the target channel.

Therefore, the target channel can be accurately detected to have sound breaking and have large noise, and the target channel is switched, so that the audio acquisition channel with good tone quality for acquiring audio signals is used as the target channel.

Similarly to N in the above-described embodiment, here, the number of frames satisfying the above-described correlation condition may be set, and if the above-described correlation condition is satisfied beyond M frames, it is determined that the switching of the target channel is necessary. The setting of M may also be adjusted according to actual requirements, and is not limited herein.

The first correlation threshold and the second correlation threshold may be the same value or different values. If the cross-correlation coefficient between the audio signal acquired by the target channel and the preset reference signal is smaller than the first correlation threshold, the preset reference signal is weaker in the audio signal acquired by the target channel; and if the cross-correlation coefficient of the audio signal acquired by the alternative channel and the preset reference signal is greater than the second correlation threshold value, indicating that the preset reference signal in the audio signal acquired by the alternative channel is stronger. At this time, the target channel may receive a large amount of noise interference, or the target channel may have a fault, and therefore, it may be determined that the target channel needs to be switched.

The autocorrelation coefficients of the audio signal represent the smoothness of the signal. The cross-correlation coefficient of the audio signal acquired by the audio acquisition channel and the second autocorrelation coefficient of the reference signal can be calculated based on the two autocorrelation coefficients. Here, the plurality of audio acquisition channels includes a current target channel and an alternative channel.

The first autocorrelation coefficients of the audio acquisition channels may be determined by the following equation (1):

S1(k,l)＝gamma*S1(k,l-1)+(1-gamma)*real(X1(k,l).*conj(X1(k,l))) (1)

s1(k, l) represents a first autocorrelation coefficient of an acquired audio signal of any audio acquisition channel at the kth frequency point of the l frame, gamma is a smoothing factor, and takes a value between 0 and 1, for example, gamma is 0.8, X1(k, l) is a frequency domain signal of the audio acquisition channel at the kth frequency point of the l frame, conj represents taking a conjugate, and real represents taking a real part.

The second autocorrelation coefficient of the reference signal may be determined by the following equation (2):

Sfar(k,l)＝gamma*Sfar(k,l-1)+(1-gamma)*real(Xfar(k,l).*conj(Xfar(k,l))) (2)

wherein Sfar (k, l) represents a second autocorrelation coefficient of the reference signal at the kth frequency point of the l frame, and Xfar (k, l) is a frequency domain signal of the reference signal at the kth frequency point of the l frame.

Based on the first autocorrelation coefficient and the second autocorrelation coefficient, the cross-correlation coefficient between the audio signal acquired by the audio acquisition channel and the reference signal may be calculated according to the following formula (3) and formula (4).

Sfar1(k,l)＝gamma*Sfar1(k,l-1)+(1-gamma)*real(Xfar(k,l).*conj(X1(k,l))) (3)

cohed(k,l)＝real(Sfar1(k,l)*conj(Sfar1(k,l))/(Sfar(k,l)*S1(k,l)) (4)

Wherein Sfar1(k, l) is an intermediate variable, and cohed (k, l) is a cross-correlation coefficient.

In some embodiments, the method further comprises:

In the embodiment of the present disclosure, the audio signal collected by the audio collecting channel may be subjected to time-frequency conversion, and the time-domain signal may be converted into a frequency-domain signal. And then calculating the power spectral density of the frequency domain noisy signal. The power spectral density is the statistics of the power of the frequency domain signal with noise at each frequency point, so that the signal energy can be obtained by weighting the value of the power spectral density at each frequency point according to the preset weight. The weight here may be a smoothing factor, and may take a value between 0 and 1, for example, a value of 0.8.

By the method, the signal energy and the data of the correlation can be obtained by performing operation processing on the audio signals acquired by the audio acquisition channel, and then the comparison is performed to judge whether the target channel needs to be switched or not, so that in the audio acquisition process, the appropriate audio acquisition channel can be automatically switched in real time to serve as the target channel, and the acquired signal quality is improved.

The disclosed embodiments also provide the following examples:

when the mobile terminal performs audio acquisition, for example, during a call of a mobile phone, if a voice is spoken loudly, a phenomenon of sound breaking or distortion may occur. In the embodiment of the present disclosure, at least two audio capturing channels of the mobile terminal, such as the mobile phone shown in fig. 3, have audio capturing channels mic1 and mic2 to capture audio signals. In the acquisition process, the microphones used for acquiring the audio signals and providing audio to be processed are switched by taking the energy detection, the correlation of the individual channel signals and the correlation of the far-end signals as judgment bases, so that the definition of the audio signals is improved, and the phenomenon of sound breaking is reduced.

In the embodiment of the present disclosure, the processing may be performed by the steps shown in fig. 4:

s401, collecting audio signals by using two audio collecting channels;

here, the audio acquisition channel mic1 may be used as a main channel, i.e., the target channel, for performing noise reduction processing on the acquired audio signal to obtain a target audio signal. The audio acquisition channel mic2 is used as an alternative channel for assisting a target channel in noise cancellation, that is, assisting in noise reduction of an audio signal acquired by the mic 1.

The audio signal collected by the mic1 is X1, the audio signal collected by the mic2 is X2, and after frame division windowing and fourier transform are respectively performed, the frequency spectrums X1(k, l) and X2(k, l) of the kth frequency point of the l frame are obtained.

When the mic1d signal energy is too large, a sound breaking phenomenon is easily generated, so that the human ear feels harsh, and therefore, the quality of the final audio signal can be improved by switching the target channel. That is, the final signal quality can be improved by switching to the mic2 as the target channel and the mic1 as the candidate channel.

S402, carrying out energy detection on the audio signals acquired by the two audio acquisition channels;

the energy detection may utilize the following equations (5) and (6):

Mic1_sp＝∑a*Mic1_sp(k,l-1)+(1-a)*P1(k,l) (5)

Mic2_sp＝∑a*Mic2_sp(k,l-1)+(1-a)*P2(k,l) (6)

where, Mic1_ sp and Mic2_ sp are the signal energies of Mic1 and Mic2, respectively, P1(k, l) and P2(k, l) are the power spectrums of Mic1 and Mic2, respectively, and a is a smoothing factor, which can be set to a value of 0.8. And carrying out weighted summation on the power spectrum of each frequency point to obtain signal energy. It should be noted that, here, the summation may be performed only in a partial frequency band, for example, the two audio acquisition channels have a large difference in the frequency band range from 1kHz (kilohertz) to 4kHz, and therefore, the above summation may be performed on the power spectrum of the frequency point k belonging to 1kHz to 4 kHz.

Here, a detection threshold Eth may be set as a threshold when the energy satisfies the switching condition. And if the signal energy meets the condition of the detection threshold, continuing to calculate the correlation coefficient.

Here, whether the signal energy satisfies the condition of the detection threshold may be judged using the following equation (7):

Mic1_sp>Eth*Mic2_sp (7)

that is, when the signal energy of the mic1 is greater than the product of the signal energy of the mic2 and the detection threshold, the condition of the detection threshold is satisfied, and the calculation of the correlation coefficient may be continued. Of course, before determining whether the signal energy satisfies the condition of the detection threshold, the calculation of the correlation coefficient may be directly performed, and finally, the signal energy and the correlation coefficient are combined to determine whether to switch the target channel.

If whether the target channel is switched or not is judged only according to the signal energy, the count _ choose can be set as the number of frames which are accumulated to exceed the threshold of the energy, for example, the threshold of switching can be set as 5 frames, when the count _ choose is greater than or equal to 5, the target channel is switched, the mic2 is used as the target channel, and the mic1 is used as the alternative channel.

Step S403, calculating a correlation coefficient of the audio signal of the audio acquisition channel;

the correlation coefficient is considered because the selection of the target channel is also related to echo cancellation, i.e. the audio collected by the audio collecting channel and emitted by the terminal itself, for example, the voice of the other party during the call, i.e. far speaking. In the disclosed embodiments, the audio signal generated by telephoning is represented by a reference signal.

Here, the cross-correlation coefficient between each audio signal and the reference signal is calculated in consideration of the correlation between the audio signal acquired by each audio acquisition channel and the reference signal.

The autocorrelation coefficients of each audio acquisition channel can be determined by equation (1) in the above embodiment:

S1(k,l)＝gamma*S1(k,l-1)+(1-gamma)*real(X1(k,l).*conj(X1(k,l))) (1)

s1(k, l) represents an autocorrelation coefficient of an acquired audio signal of any audio acquisition channel at the kth frequency point of the frame l, gamma is a smoothing factor, and takes a value between 0 and 1, for example, gamma is 0.8, X1(k, l) is a frequency domain signal of the audio acquisition channel at the kth frequency point of the frame l, conj represents taking a conjugate, and real represents taking a real part.

The autocorrelation coefficient of the reference signal can be determined by equation (2) in the above embodiment:

Sfar(k,l)＝gamma*Sfar(k,l-1)+(1-gamma)*real(Xfar(k,l).*conj(Xfar(k,l))) (2)

The cross-correlation coefficient between the audio signal collected by the audio collection channel and the reference signal is calculated according to formula (3) and formula (4) in the above embodiments.

Sfar1(k,l)＝gamma*Sfar1(k,l-1)+(1-gamma)*real(Xfar(k,l).*conj(X1(k,l))) (3)

cohed(k,l)＝real(Sfar1(k,l)*conj(Sfar1(k,l))/(Sfar(k,l)*S1(k,l)) (4)

It should be noted that the calculation of the correlation coefficient may be performed at all frequency points, that is, the correlation coefficient is expressed at all frequency points of k.

And S404, judging whether to switch the target channel by combining energy detection and the correlation coefficient.

Here, it may be determined whether the correlation coefficient satisfies the threshold condition using the following equation (8):

cohed1<Ethrod1&&cohed2>Ethrod2 (8)

wherein, the cohed1 and cohed2 are cross-correlation coefficients of the mic1 and the mic2 with the reference signal, respectively. Ethrod1 is a threshold for the cross-correlation coefficient of mic1 with the reference signal, and Ethrod2 is a threshold for the cross-correlation coefficient of mic2 with the reference signal. When one frame meets the above conditions, the counter can be used for counting, and when the count value is greater than the preset threshold value, the target channel is switched. That is, the count _ choose may be set to be the number of frames accumulated to satisfy the above condition, for example, the threshold of switching may be set to be 5 frames, and when the count _ choose is greater than or equal to 5, the target channel is switched, the mic2 is used as the target channel, and the mic1 is used as the candidate channel.

Through the technical scheme of the embodiment of the disclosure, the target channel is automatically switched by utilizing the energy detection and the calculation of the correlation coefficient, so that the sound breaking phenomenon caused by overlarge sound can be well reduced, and the definition of the voice signal is improved.

Fig. 5 is a block diagram illustrating a configuration of an apparatus for processing an audio signal according to an exemplary embodiment, and as shown in fig. 5, the apparatus 500 includes:

an acquisition module 501, configured to acquire an audio signal by using at least two audio acquisition channels; wherein the audio acquisition channels comprise: a target channel and an alternate channel; the audio signal collected by the alternative channel is used for carrying out noise filtering on the audio signal collected by the target channel;

a first determining module 502, configured to determine, according to signal energy of an acquired audio signal, that signal energy of a target channel meets a preset energy condition;

the switching module 503 is configured to switch the audio acquisition channel corresponding to the target channel when the signal energy of the target channel meets a preset energy condition.

In some embodiments, the switching module comprises:

In some embodiments, the determining sub-module comprises:

In some embodiments, the apparatus further comprises:

and the third determining module is used for determining the signal energy according to the weighted sum of the power spectral densities of the audio signals at all the frequency points according to the preset weight. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 6 is a block diagram illustrating an apparatus 600 for processing an audio signal according to an exemplary embodiment. For example, the apparatus 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.

Referring to fig. 6, apparatus 600 may include one or more of the following components: a processing component 601, a memory 602, a power component 603, a multimedia component 604, an audio component 605, an input/output (I/O) interface 606, a sensor component 607, and a communication component 608.

The processing component 601 generally controls the overall operation of the device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 601 may include one or more processors 610 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 601 may also include one or more modules that facilitate interaction between the processing component 601 and other components. For example, the processing component 601 may include a multimedia module to facilitate interaction between the multimedia component 604 and the processing component 601.

The memory 610 is configured to store various types of data to support operations at the apparatus 600. Examples of such data include instructions for any application or method operating on the apparatus 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 602 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 603 provides power to the various components of the device 600. The power supply component 603 may include: a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 600.

The multimedia component 604 includes a screen that provides an output interface between the device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 604 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and/or rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

Audio component 605 is configured to output and/or input audio signals. For example, audio component 605 includes a Microphone (MIC) configured to receive external audio signals when apparatus 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 610 or transmitted via the communication component 608. In some embodiments, audio component 605 also includes a speaker for outputting audio signals.

The I/O interface 606 provides an interface between the processing component 601 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 607 includes one or more sensors for providing various aspects of status assessment for the apparatus 600. For example, the sensor component 607 may detect the open/closed state of the apparatus 600, the relative positioning of components, such as a display and keypad of the apparatus 600, the sensor component 607 may also detect a change in the position of the apparatus 600 or a component of the apparatus 600, the presence or absence of user contact with the apparatus 600, orientation or acceleration/deceleration of the apparatus 600, and a change in the temperature of the apparatus 600. The sensor component 607 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor component 607 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 607 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 608 is configured to facilitate wired or wireless communication between the apparatus 600 and other devices. The apparatus 600 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 608 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 608 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.

In an exemplary embodiment, the apparatus 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 602 comprising instructions, executable by the processor 610 of the apparatus 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the method provided in any of the embodiments.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of processing an audio signal, comprising:

2. The method according to claim 1, wherein switching the audio acquisition channel corresponding to the target channel when the signal energy of the target channel satisfies a preset energy condition comprises:

3. The method according to claim 2, wherein switching the audio acquisition channel corresponding to the target channel if the signal energy of the target channel exceeds a product of the signal energy of the candidate channel and a preset reference threshold comprises:

4. The method according to any one of claims 1 to 3, wherein switching the audio capture channel corresponding to the target channel when the signal energy of the target channel satisfies a preset energy condition comprises:

5. The method according to claim 4, wherein the switching the audio acquisition channel corresponding to the target channel if the cross-correlation coefficient between the audio signals acquired by the at least two audio acquisition channels and the reference signal satisfies a preset correlation condition comprises:

6. The method of claim 4, wherein the determining cross-correlation coefficients of the audio signals acquired by the at least two audio acquisition channels with a preset reference signal comprises:

7. The method of claim 1, further comprising:

8. An apparatus for processing an audio signal, comprising:

9. The apparatus of claim 8, wherein the switching module comprises:

10. The apparatus of claim 9, wherein the first switching submodule is specifically configured to:

11. The apparatus according to any one of claims 8 to 10, wherein the switching module comprises:

12. The apparatus according to claim 11, wherein the second switching submodule is specifically configured to:

13. The apparatus of claim 12, wherein the determining sub-module comprises:

14. The apparatus of claim 8, further comprising:

15. Communication device of a terminal, characterized in that it comprises at least: a processor and a memory for storing executable instructions operable on the processor, wherein:

the processor is configured to execute the executable instructions, and the executable instructions perform the steps in the communication method of the terminal as provided in any one of the preceding claims 1 to 7.

16. A non-transitory computer-readable storage medium, wherein computer-executable instructions are stored in the computer-readable storage medium, and when executed by a processor, implement the steps in the communication method of the terminal provided in any one of claims 1 to 7.