CN110769352A

CN110769352A - Signal processing method and device and computer storage medium

Info

Publication number: CN110769352A
Application number: CN201810826906.7A
Authority: CN
Inventors: 崔腾飞
Original assignee: Xian Zhongxing New Software Co Ltd
Current assignee: Xian Zhongxing New Software Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-02-07
Anticipated expiration: 2038-07-25
Also published as: WO2020020247A1; CN110769352B

Abstract

The embodiment of the invention discloses a signal processing method, a signal processing device and a computer storage medium, wherein a first audio signal is received and played by at least one loudspeaker; acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on at least one speaker and the at least one microphone model is derived based on at least one microphone; receiving a second audio signal with at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone; and subtracting at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an audio signal after echo processing.

Description

Signal processing method and device and computer storage medium

Technical Field

The present invention relates to the field of audio signal processing technologies, and in particular, to a signal processing method and apparatus, and a computer storage medium.

Background

During a conversation, people sometimes hear their own speaking voice, mainly because the voice played by the opposite speaker is received by the Microphone (MIC) and transmitted back, so-called echo, which is generated by acoustic reasons. Therefore, in the MIC duplex scenario involving speakers, echo phenomena generally occur, such as a current terminal call, a Personal Computer (PC) network phone, a Personal Digital Assistant (PDA) network call, a video-while-broadcasting scenario, and so on.

As the name implies, the echo cancellation technology is a technology for processing the sound played by the speaker received by the MIC, and only the sound not played by the speaker is retained. The most common echo cancellation technology at present is that of a single loudspeaker (receiver), the stereo echo cancellation technical scheme has larger difference, the multi-channel echo cancellation technology is used sparsely, and the solution is various; however, in these prior art solutions, when the high frequency resonance peak FH of the speaker or MIC is relatively low (e.g. around 4 kHz), it is easy to cause loop amplification because the received signal at the MIC is higher than the reference signal at FH and cannot be processed cleanly, thereby generating the echo-whistling phenomenon.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide a signal processing method, device and computer storage medium, which effectively solve the echo and whistling phenomena easily generated when the high-frequency resonance peak of the speaker or microphone is low, and also reduce the calculation workload in the multi-microphone design.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a signal processing method, which is applied to a signal processing apparatus having at least one speaker and at least one microphone, and includes:

receiving a first audio signal and playing the first audio signal by the at least one speaker;

acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone;

receiving a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone;

and subtracting at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an audio signal after echo processing.

In a second aspect, an embodiment of the present invention provides a signal processing apparatus, including: at least one speaker, at least one microphone, a first receiving section, a first acquiring section, a second receiving section, and a second acquiring section, wherein,

the first receiving part is configured to receive a first audio signal and play the first audio signal by the at least one loudspeaker;

the first obtaining part is configured to obtain at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone;

the second receiving portion configured to receive a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone;

the second obtaining part is configured to subtract at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an audio signal after echo processing.

In a third aspect, an embodiment of the present invention provides a signal processing apparatus, including: a network interface, a memory, and a processor; wherein the content of the first and second substances,

the network interface is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

the memory for storing a computer program operable on the processor;

the processor is configured to, when running the computer program, perform the steps of the method of signal processing of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium storing a signal processing program, where the signal processing program, when executed by at least one processor, implements the steps of the method for signal processing according to the first aspect.

The embodiment of the invention provides a signal processing method, a signal processing device and a computer storage medium, which are applied to a signal processing device with at least one loudspeaker and at least one microphone, wherein a first audio signal is received and played by the at least one loudspeaker; acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone; receiving a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone; subtracting at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal; therefore, the echo howling phenomenon easily generated when the high-frequency resonance peak value of the loudspeaker or the microphone is low is effectively solved, and the calculation workload in the multi-microphone design is reduced.

Drawings

Fig. 1 is a schematic circuit diagram of a single speaker and a single microphone according to the related art;

fig. 2 is a schematic circuit diagram of a single speaker and a dual microphone according to the related art;

fig. 3 is a schematic circuit diagram of a dual speaker and a single microphone according to the related art;

fig. 4 is a graph comparing an echo reference signal and a microphone recording signal according to a related art;

fig. 5 is a graph comparing an echo reference signal and a microphone recording signal according to another related art;

fig. 6 is a schematic circuit diagram of a dual speaker and a single microphone according to the related art;

fig. 7 is a schematic circuit diagram of a dual speaker and dual microphones according to the related art;

fig. 8 is a schematic flowchart of a signal processing method according to an embodiment of the present invention;

fig. 9 is a schematic circuit diagram of a four-speaker and single-microphone circuit according to an embodiment of the present invention;

fig. 10 is a schematic circuit diagram of another four-speaker and single-microphone circuit according to an embodiment of the present invention;

fig. 11 is a schematic circuit diagram of a four-speaker and two-microphone system according to an embodiment of the present invention;

fig. 12 is a schematic circuit diagram of another four-speaker and single-microphone circuit according to an embodiment of the present invention;

fig. 13 is a detailed flowchart of a signal processing method according to an embodiment of the present invention;

fig. 14 is a detailed flowchart of another signal processing method according to an embodiment of the present invention;

fig. 15 is a detailed flowchart of another signal processing method according to an embodiment of the present invention;

fig. 16 is a detailed flowchart of another signal processing method according to an embodiment of the present invention;

fig. 17 is a detailed flowchart of another signal processing method according to an embodiment of the present invention;

fig. 18 is a schematic circuit diagram of a single speaker and a single microphone according to an embodiment of the present invention;

fig. 19 is a detailed flowchart of another signal processing method according to an embodiment of the present invention;

fig. 20 is a schematic circuit diagram of a dual speaker and a single microphone according to an embodiment of the present invention;

fig. 21 is a schematic structural diagram of a signal processing apparatus according to an embodiment of the present invention;

fig. 22 is a schematic structural diagram of another signal processing apparatus according to an embodiment of the present invention;

fig. 23 is a schematic structural diagram of another signal processing apparatus according to an embodiment of the present invention;

fig. 24 is a schematic structural diagram of a further signal processing apparatus according to an embodiment of the present invention;

fig. 25 is a schematic structural diagram of a further signal processing apparatus according to an embodiment of the present invention;

fig. 26 is a schematic diagram of a specific hardware structure of a signal processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Echo, mainly the repetition of sound caused by the reflection of sound waves, means that sound emitted by a sound source is reflected back to the source position. In electronic devices, a microphone and a speaker are generally used. The microphone transmits voice or other sound data from the near end to the far end while the speaker plays the sound data received by the far end. For a typical hands-free system, a speaker is placed in close proximity to a microphone and the sound emitted by the speaker is immediately received by the microphone, the so-called echo. Without treatment, the echo is heard by the remote user at the far end, creating an unexpectedly loud noise and unpleasant psychoacoustic experience. Therefore, echo cancellation techniques are introduced to echo-process the echo extracted by the microphone.

Currently, the echo cancellation technology of a single speaker is widely used, that is, only one speaker plays sound during a call. Exemplarily, referring to fig. 1, an application example of a circuit configuration 10 of a single speaker and a single microphone according to the related art is shown; as shown in fig. 1, the circuit structure 10 includes: a speaker 101, a microphone 102, an adder 103, an Adaptive Filter (AF) module 104, a speech processing module 105, a noise reduction and other processing module 106, a Decoder (Decoder)107, an Encoder (Encoder)108, and a radio frequency terminal 109; the radio frequency terminal 109 sends the received audio signal to the decoder 107, the decoder 107 demodulates the audio signal, and the demodulated audio signal enters the voice processing module 105 for voice processing such as noise reduction and filtering; the voice signal will then be played by the speaker 101; the audio signal recorded by the microphone 102 includes an echo signal generated by the audio signal played by the speaker 101; in order to eliminate the echo signal as much as possible, an audio signal is extracted at the front end of the speaker 101 as a reference signal, the reference signal is input to the adder 103 after passing through the AF module 104, the audio signal recorded by the microphone 102 is also input to the adder 103, and the two signals are subtracted in the adder 103, so that the audio signal recorded by the microphone 102 can be subjected to echo processing; the echo processed audio signal is further processed by a noise reduction and other processing module 106, modulated in an encoder 108, and finally transmitted out through a transmission line routing transmitting terminal 109.

Referring to fig. 2, an application example of a circuit configuration 20 of a single speaker and a dual microphone of the related art is shown; as shown in fig. 2, the circuit structure 20 includes: a loudspeaker 201, a first microphone 202a, a second microphone 202b, a first adder 203a, a second adder 203b, a first AF module 204a, a second AF module 204b, a voice processing module 205, a noise reduction and other processing module 206, a decoder 207, an encoder 208 and a transmitting terminal 209; the audio signal at the front end of the speaker 201 is still selected as a reference signal, and the reference signal performs echo processing on the audio signal received by the first microphone 202a through the first AF module 204a and the first adder 203a on one hand, and performs echo processing on the audio signal received by the second microphone 202b through the second AF module 204b and the second adder 203b on the other hand, and the audio signal after the echo processing is input to the noise reduction and other processing module 206 for noise reduction processing, modulated in the encoder 208, and finally transmitted out through the transmission line from the transmitting terminal 209.

Referring to fig. 3, an application example of a circuit configuration 30 of a dual speaker and single microphone of the related art is shown; as shown in fig. 3, the circuit structure 30 includes: a first speaker 301a, a second speaker 301b, a microphone 302, an adder 303, an AF module 304, a voice processing module 305, a noise reduction and other processing module 306, a decoder 307, an encoder 308, and a transmitting terminal 309; the audio signals at the front ends of the first speaker 301a and the second speaker 301b are still selected as reference signals for echo processing, the reference signals perform echo processing on the audio signals received by the microphone 302 through the AF module 304 and the adder 303, the audio signals after echo processing are input to the noise reduction and other processing module 306 for noise reduction processing, then the audio signals are modulated in the encoder 308, and finally the audio signals are sent out through the transmission line route transmitting terminal 309.

In the process of implementing the present invention, it is found that the echo processing method in the echo cancellation technology has an obvious defect, referring to fig. 4 and 5, when the high frequency resonance peak value of the frequency response of the speaker or the microphone is relatively low, for example, the resonance peak value of the speaker is near 4kHz, the echo signal can not be processed cleanly, and if the hands-free loudness of the speaker is relatively high, the echo phenomenon easily occurs; this is because the reference signal is only a digital signal at the front end of the speaker, and the acoustic influence introduced by the speaker and the microphone is not considered, so that the signal near the resonance peak is amplified, which causes the loop-back amplification of the sound signal near the resonance peak, and thus the echo cannot be processed. As can be seen from fig. 4, the reference signal of the echo is larger than the microphone recording signal (i.e. the echo signal to be processed), and the linearity of the two signals is relatively good, so that the echo in this case can be easily processed; however, as can be seen from fig. 5, near 4kHz, the microphone recording signal (i.e. the echo signal to be processed) is larger than the reference signal of the echo, and the echo can be difficult to process.

In addition, the circuit structures in fig. 1 to fig. 2 are suitable for a single speaker application scenario, and for a multi-speaker application scenario, for example, as shown in fig. 3, if only one path of signal at the front end of a certain speaker is used as a reference signal of an echo to perform echo processing, the influence of overlapping of sounds of multiple speakers is not considered, so that the effect of echo processing is poor; especially, when the loudness of sound played by a loudspeaker is larger, the echo processing effect is worse.

In the multi-channel echo cancellation technology, because each speaker affects echo signals, in the existing partial processing scheme, a plurality of echo processing modules are generally added in an algorithm of a microphone input channel, generally speaking, the number of speakers in an electronic device is large, a corresponding number of echo processing modules are used, and each echo processing module introduces an audio signal of one speaker as a reference signal; referring to fig. 6, an application example of a circuit configuration 60 of a dual speaker and single microphone of the related art is shown; as shown in fig. 6, the circuit structure 60 includes: a first speaker 601a, a second speaker 601b, a microphone 602, a first adder 603a, a second adder 603b, a first AF module 604a, a second AF module 604b, a speech processing module 605, a noise reduction and other processing module 606, a decoder 607, an encoder 608, and a transmitting terminal 609; the audio signals at the front ends of the first speaker 601a and the second speaker 601b are still selected as reference signals for echo processing, the audio signals at the front end of the first speaker 601a are input to the first adder 603a after passing through the first AF module 604a, the audio signals received by the microphone 602 are also input to the first adder 603a, and the first adder 603a performs echo processing on an echo generated by the audio signal played by the first speaker 601a in the audio signals recorded by the microphone 602; the audio signal at the front end of the second speaker 601b passes through the second AF module 604b and then is input to the second adder 603b, the audio signal received by the microphone 602 is also input to the second adder 603b, and the second adder 603b performs echo processing on an echo generated by the audio signal played by the second speaker 601b in the audio signal recorded by the microphone 602; the audio signal after echo processing is input to the noise reduction and other processing module 606 for processing, then modulated in the encoder 608, and finally transmitted out through the transmission line routing transmitting terminal 609. The processing method considers the influence of superposition of a plurality of loudspeakers on the echo signal, and has better processing effect than that of the method of only introducing the digital signal at the front end of a certain loudspeaker as the reference signal of the echo in the figure 3; however, the above-mentioned drawbacks also exist, and since the acoustic influence introduced by the speaker and the microphone is not considered, when the resonance peak values of the speaker and the microphone are low, a relatively obvious echo and howling phenomenon still exists; meanwhile, due to the introduction of a plurality of echo processing modules, the more the number of the loudspeakers and the microphones is, the more the data of the echo processing modules is, and the complexity is obviously increased; in addition, since a plurality of echo processing modules may exist in one microphone, the mutual influence between the echo processing modules can significantly increase the debugging complexity.

In the multi-channel echo cancellation technology, the current prior art solution also uses the transfer function of the loudspeaker playing and the microphone recording signals to calculate. Referring to fig. 7, an example of an application of a dual speaker and multi-microphone circuit configuration 70 of the related art is shown; as can be seen from FIG. 7, the left/right two-channel stereo signal X input to the line input terminals LI (L) and LI (R)_LAnd X_RNot passing through and/or signal generating device 52, and respectively output through sound output terminals so (l) and so (r) and reproduced at speakers sp (l) and sp (r), and then collected by microphones mc (l), mc (r) and input to sound input terminals si (l), si (r); filters 40-1, 40-2, 40-3 and 40-4 are formed, for example, by FIR filters, and the impulse responses set by filters 40-1, 40-2, 40-3 and 40-4 correspond to transfer functions between speakers sp (l), sp (r) and microphones mc (l), mc (r), respectively, thereby generating echo processed signals EC1, EC2, EC3 and EC4, respectively; adders 44 and 46 and subtracters 48 and 50 for performing echo processing, these echo processing signals being output from line outputs lo (l) and lo (r), respectively; here, the sum/difference signal generating means 52 comprises an adder 54 and a subtractor 56 for generating the sum signal X_L+X_RSum and difference signal X_L-X_R(ii) a The correlation detection means 60 detects the sum signal X based on correlation value calculation or the like_L+X_RSum and difference signal X_L-X_RThe correlation between them; the transfer function calculation means 58 are used to calculate transfer functions for deriving the four audio transfer systems between the loudspeakers sp (l), (r) and the microphones mc (l), (r). The technical scheme is that a sum signal and a difference signal of a stereo sound signal are used as reference signals, and transfer functions of four audio transmission systems between two loudspeakers and two microphones are obtained according to cross-spectrum calculation of the reference signals and sound signals recorded by the microphones; the obtained transfer function is subjected to inverse Fourier transform to obtain impulse responses, and the impulse responses are set in a filter device to generate a reference signal for echo processing and carry out echo processing; the technical scheme also considers the influence of an acoustic path, an acoustic structure and a device on an echo signal, simultaneously introduces a sum signal and a difference signal of two loudspeaker signals and an echo signal recorded by a microphone in the process of analyzing a transfer function, and considers the more comprehensive consideration; however, in the process of obtaining the transfer function, the loudspeaker is required to play the audio signal, the microphone simultaneously receives the audio signal, the influence of the environmental fluctuation is large, and the phenomenon that the working states of the loudspeaker and the microphone are inconsistent under different audio signals and environments exists, so that the phenomenon that echo processing is not ideal under certain conditions occurs; meanwhile, as the number of speakers and microphones increases, the complexity of the transfer function also increases, and the increase is very obvious.

In the application examples of the circuit structures shown in fig. 1 to 7, the prior art hardly considers introducing an acoustic response model of a speaker and a microphone, so that when a high-frequency resonance peak FH of the speaker or the microphone is relatively low (such as near 4 kHz), a howling phenomenon is easily generated because a signal recorded by the microphone has a higher amplitude at FH than a reference signal; in order to effectively solve the echo howling phenomenon generated when the high frequency resonance peak FH of the speaker or the microphone is relatively low, the following describes the embodiments of the present invention in detail with reference to the accompanying drawings.

Example one

Referring to fig. 8, which illustrates a method for signal processing provided by an embodiment of the present invention, the method is applied to a signal processing apparatus having at least one speaker and at least one microphone, and the method may include:

s801: receiving a first audio signal and playing the first audio signal by the at least one speaker;

s802: acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone;

s803: receiving a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone;

s804: and subtracting at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an audio signal after echo processing.

Based on the technical solution shown in fig. 8, the method is applied to a signal processing apparatus having at least one speaker and at least one microphone, by receiving a first audio signal and playing the first audio signal by the at least one speaker; acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone; receiving a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone; subtracting at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal; therefore, the echo howling phenomenon easily generated when the high-frequency resonance peak value of the loudspeaker or the microphone is low is effectively solved, and the calculation workload in the multi-microphone design is reduced.

For the technical solution shown in fig. 8, in a possible implementation manner, before the receiving the first audio signal and playing the first audio signal by the at least one speaker, the method further includes:

demodulating and voice preprocessing the first audio signal; wherein the first audio signal is generated and transmitted by a remote device.

It should be noted that, in general, a remote device generates and sends a first audio signal, and after receiving the first audio signal, the signal processing device performs demodulation and voice preprocessing, and the processed first audio signal enters a speaker and is played by the speaker.

For the technical solution shown in fig. 8, in a possible implementation manner, before the obtaining, according to at least one speaker model, at least one microphone model and the first audio signal, at least one echo estimation signal corresponding to the first audio signal, the method further includes:

correspondingly establishing at least one loudspeaker model according to the characteristic information of the at least one loudspeaker; wherein the characteristic information of the at least one speaker includes circuit information and structure information corresponding to the at least one speaker;

correspondingly establishing at least one microphone model according to the characteristic information of the at least one microphone; wherein the characteristic information of the at least one microphone includes circuit information and structure information corresponding to the at least one microphone.

It should be noted that, in the embodiment of the present invention, a speaker model and a microphone model are introduced; the loudspeaker model is established based on circuit information and structural information corresponding to the loudspeaker, and acoustic response of the loudspeaker can be simulated; the microphone model is established based on circuit information and structural information corresponding to the microphone, and acoustic response of the microphone can be simulated; the acquired reference signal of the first audio signal can be more accurate and is closer to the echo signal of the first audio signal played by the loudspeaker, so that the processing effect of the echo signal is better.

It is to be understood that the number of the microphones may be one or more, and is not particularly limited in the embodiment of the present invention. When the number of the microphones is one, the number of the correspondingly established microphone models is one; at this time, in the foregoing implementation manner, specifically, the acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model and the first audio signal includes:

inputting the first audio signal into each loudspeaker model for acoustic response processing to obtain an acoustic response signal of each loudspeaker;

obtaining a first reference signal of each loudspeaker according to the acoustic response signal of each loudspeaker and first delay and attenuation information corresponding to the acoustic response signal; the first delay and attenuation information is obtained correspondingly based on the distance between each loudspeaker and the microphone and the position information;

superposing and bit number expanding processing are carried out on the first reference signal of each loudspeaker to obtain a first superposed reference signal;

based on the microphone model, performing acoustic response processing on the first superimposed reference signal to obtain a first echo estimation signal of the first audio signal.

It should be noted that, in the embodiment of the present invention, the echo estimation signal of the first audio signal is obtained through a series of signal processing steps, such as a speaker model, a delay and attenuation step, a signal superposition step, and a microphone model, so that the echo estimation signal is closer to the echo signal of the first audio signal played by the speaker, and the processing effect of the subsequent echo signal is better.

For example, referring to fig. 9, an application example of a circuit structure 90 of a four-speaker and single-microphone according to an embodiment of the present invention is shown; as shown in fig. 9, the circuit structure 90 includes: a loudspeaker group 901, a microphone 902, a loudspeaker model group 903, a time delay attenuation module group 904, a summation module 905, a microphone model 906, an echo processing module 907, a voice processing module 908, a noise reduction and other processing module 909, a decoder 910, an encoder 911 and a transmitting end 912; the loudspeaker set 901 comprises a first loudspeaker 901a, a second loudspeaker 901b, a third loudspeaker 901c and a fourth loudspeaker 901d, the loudspeaker model set 903 comprises a first loudspeaker model 903a, a second loudspeaker model 903b, a third loudspeaker model 903c and a fourth loudspeaker model 903d, and the time delay attenuation module group 904 comprises a first time delay attenuation module 904a, a second time delay attenuation module 904b, a third time delay attenuation module 904c and a fourth time delay attenuation module 904 d; here, the speaker model group 903 is correspondingly established based on the speaker group 901, and the delay attenuation module group 904 performs corresponding delay and attenuation on the audio signal passing through the speaker model group 903. Specifically, the radio frequency terminal 912 sends the received first audio signal to the decoder 910, the decoder 910 demodulates the first audio signal, and the demodulated first audio signal enters the speech processing module 908 for preprocessing such as noise reduction and filtering; the first audio signal is then played by the speaker group 901; the second audio signal recorded by the microphone 902 includes an echo signal generated by the first audio signal played by the speaker group 901; in order to process the echo signal, a first audio signal is extracted at the front end of the speaker group 901, and the first audio signal is respectively input into a first speaker model 903a, a second speaker model 903b, a third speaker model 903c and a fourth speaker model 903d, and is subjected to acoustic response analysis processing to obtain an acoustic response signal of each speaker in the speaker group 901; because the sound needs time in the process of propagation and the energy is attenuated in the process of propagation, in order to represent the delay and attenuation amount of the sound played by the loudspeaker transmitted to the microphone position, namely the sound transmission time and the attenuation amount, the acoustic response signal is correspondingly input into the first delay attenuation module 904a, the second delay attenuation module 904b, the third delay attenuation module 904c and the fourth delay attenuation module 904d for delay attenuation processing, and a first reference signal after 4 paths of delay and attenuation is obtained; the superposition processing is performed through a summation module 905, so that a superposed reference signal can be obtained; here, it is also necessary to consider that overflow may occur after superposition, and to perform bit expansion processing to expand at least two binary numbers, generally by four times, mainly to prevent overflow after superposition; the superposed reference signal is input to the microphone model 906, and acoustic response processing is performed according to sound pressure excitation of the microphone model 906, so that an echo reference signal recorded by the microphone 902, namely a first echo estimation signal of the first audio signal, can be obtained; the first echo estimation signal and the second audio signal recorded by the microphone 902 are input to the echo processing module 907 together for echo processing; the echo processed audio signal is further processed by noise reduction and other processing module 909 for noise reduction, then modulated in encoder 911, and finally sent out through transmission line routing transmitting terminal 912.

It should be noted that, since the sound signals of a plurality of speakers (for example, the four speakers 901a, 901b, 901c, and 901d in fig. 9) are subjected to the superposition and the bit number expansion processing, the bit number expansion is required mainly in consideration of the situation that overflow may occur after the superposition; the superposed sound signal is used as an echo estimation signal and input to an echo processing module, and is mainly used for carrying out echo processing on an echo signal generated by playing a first audio signal by a loudspeaker in a second audio signal recorded by a microphone, and the amplitude of the original audio signal cannot be increased after the echo signal passes through the echo processing module; therefore, in the echo processing module, the bit number expansion processing does not need to be performed in consideration of the second audio signal recorded by the microphone.

It will be appreciated that when the high frequency resonance peak of the microphone is relatively high (e.g. greater than 8kHz), in this case, the processing of the acoustic response of the microphone model to the reference signal obtained by the first superposition may also be eliminated; therefore, in the above specific implementation manner, preferably, after the obtaining the first superimposed reference signal, the method further includes:

acquiring a high-frequency resonance peak value of the microphone;

comparing the high-frequency resonance peak value of the microphone with a preset high-frequency resonance peak value;

and if the high-frequency resonance peak value of the microphone is higher than a preset high-frequency resonance peak value, canceling the acoustic response processing of the microphone model to the first superposed reference signal.

It should be noted that if the frequency-to-width ratio of the first microphone (such as the microphone 902 in fig. 9) is wide, that is, the high-frequency resonance peak of the microphone 902 is high; in this case, the obtained first superimposed reference signal may be used as the echo estimation signal without performing the acoustic response processing on the obtained first superimposed reference signal through the microphone model 906, that is, without performing the correction on the echo estimation signal through the microphone model 906. For example, with reference to the circuit structure 90 shown in fig. 9, after the first superimposed reference signal is obtained by the summing module 905, the determination is performed according to the frequency width of the microphone 902, that is, the high-frequency resonance peak of the microphone 902 is compared with the preset high-frequency resonance peak; assuming that the preset high-frequency resonance peak value is 8kHz, if the high-frequency resonance peak value of the microphone 902 is 5kHz, that is, the high-frequency resonance peak value of the microphone 902 is lower than the preset high-frequency resonance peak value, the microphone model 906 is required to perform acoustic response processing on the first superimposed reference signal, so that a corrected echo estimation signal can be obtained; if the high-frequency resonance peak of the microphone 902 is 9kHz, that is, the high-frequency resonance peak of the microphone 902 is higher than the preset high-frequency resonance peak, then the microphone model 906 is not required to perform the acoustic response processing on the first superimposed reference signal, which is shown in fig. 10, which illustrates an application example of the circuit structure 100 of another four-speaker and single-microphone provided by the embodiment of the present invention; as shown in fig. 10, the circuit structure 100 includes: a loudspeaker group 1001, a microphone 1002, a loudspeaker model group 1003, a time delay attenuation module group 1004, a summation module 1005, an echo processing module 1006, a voice processing module 1007, a noise reduction and other processing module 1008, a decoder 1009, an encoder 1010 and a transmitting terminal 1011; the speaker group 1001 includes a first speaker 1001a, a second speaker 1001b, a third speaker 1001c, and a fourth speaker 1001d, the speaker model group 1003 includes a first speaker model 1003a, a second speaker model 1003b, a third speaker model 1003c, and a fourth speaker model 1003d, and the time-delay attenuation module group 1004 includes a first time-delay attenuation module 1004a, a second time-delay attenuation module 1004b, a third time-delay attenuation module 1004c, and a fourth time-delay attenuation module 1004 d. Specifically, the circuit configuration 100 shown for fig. 10 only reduces the microphone models compared to the circuit configuration 90 shown in fig. 9. That is, in fig. 10, the step of the microphone model performing acoustic response processing on the first superimposed reference signal is eliminated; after the superposed reference signal is obtained, the superposed reference signal is directly input into the echo processing module 1006 as an echo estimation signal to perform echo processing on the second audio signal recorded by the microphone 1002; the operation of the circuit structure 90 shown in fig. 9 is the same as that of the circuit structure described above, and will not be described in detail here.

It is to be understood that the number of the microphones may be one or more, and is not particularly limited in the embodiment of the present invention. When the number of the microphones is two, the number of the correspondingly established microphone models is two; at this time, in the foregoing implementation manner, specifically, the number of the microphones is two, the number of the corresponding microphone models is two, and the obtaining, according to the at least one speaker model, the at least one microphone model and the first audio signal, the at least one echo estimation signal corresponding to the first audio signal includes:

obtaining a first reference signal and a second reference signal of each loudspeaker according to the acoustic response signal of each loudspeaker, first delay and attenuation information corresponding to the acoustic response signal and second delay and attenuation information corresponding to the acoustic response signal; the first delay and attenuation information is obtained by corresponding to distance and position information between each loudspeaker and a first microphone in the microphones, and the second delay and attenuation information is obtained by corresponding to distance and position information between each loudspeaker and a second microphone in the microphones;

respectively superposing the first reference signal and the second reference signal of each loudspeaker and carrying out digit extension processing to correspondingly obtain a first superposed reference signal and a second superposed reference signal;

performing acoustic response processing on the first superimposed reference signal based on a first microphone model of the microphone models to obtain a first echo estimation signal of the first audio signal;

and performing acoustic response processing on the second superposed reference signal based on a second microphone model in the microphone models to obtain a second echo estimation signal of the first audio signal.

For example, referring to fig. 11, an application example of a circuit structure 110 of a four-speaker and two-microphone provided by the embodiment of the present invention is shown; as shown in fig. 11, the circuit structure 110 includes: the system comprises a loudspeaker set 1101, a first microphone 1102a, a second microphone 1102b, a loudspeaker model set 1103, a first delay attenuation module set 1104-1, a second delay attenuation module set 1104-2, a first summation module 1105a, a second summation module 1105b, a first microphone model 1106a, a second microphone model 1106b, a first echo processing module 1107a, a second echo processing module 1107b, a voice processing module 1108, a noise reduction and other processing module 1109, a decoder 1111, an encoder 1111 and a transmitting end 1112; the loudspeaker set 1101 comprises a first loudspeaker 1101a, a second loudspeaker 1101b, a third loudspeaker 1101c and a fourth loudspeaker 1101d, the loudspeaker model set 1103 comprises a first loudspeaker model 1103a, a second loudspeaker model 1103b, a third loudspeaker model 1103c and a fourth loudspeaker model 1103d, the first time delay attenuation module 1104-1 comprises a first time delay attenuation module 1104-1a, a second time delay attenuation module 1104-1b, a third time delay attenuation module 1104-1c and a fourth time delay attenuation module 1104-1d, and the second time delay attenuation module 1104-2 comprises a fifth time delay attenuation module 1104-2a, a sixth time delay attenuation module 1104-2b, a seventh time delay attenuation module 1104-2c and an eighth time delay attenuation module 1104-2 d; here, the speaker model group 1103 is correspondingly established based on the speaker group 1101, and the first delay attenuation module 1104-1 and the second delay attenuation module 1104-2 correspondingly delay and attenuate the audio signal passing through the speaker model group 1103. Specifically, the radio frequency end 1112 sends the received first audio signal to the decoder 1111, the decoder 1111 demodulates the first audio signal, and the demodulated first audio signal enters the speech processing module 1108 for preprocessing such as noise reduction and filtering; the first audio signal is then played by the speaker group 1101; the second audio signal recorded by the first microphone 1102a includes an echo signal generated by the first audio signal played by the speaker group 1101, and the second audio signal recorded by the second microphone 1102b also includes an echo signal generated by the first audio signal played by the speaker group 1101; in order to eliminate the echo signals as much as possible, a first audio signal is extracted at the front end of the speaker group 1101, if the first audio signals input by the four speakers are different, the extracted first audio signal is correspondingly input to the first speaker model 1103a, the second speaker model 1103b, the third speaker model 1103c and the fourth speaker model 1103d, and then acoustic response analysis processing is performed, so that an acoustic response signal of each speaker in the speaker group 1101 can be obtained; since the sound needs time during the propagation process and the energy is attenuated during the propagation, in order to represent the delay and attenuation of the sound played by the speaker to the microphone position, the corresponding delay and attenuation in the first delay and attenuation module 1104-1 are obtained according to the distance and position information between the first microphone 1102a and each speaker in the speaker group 1101, and the corresponding delay and attenuation in the second delay and attenuation module 1104-2 are obtained according to the distance and position information between the second microphone 1102b and each speaker in the speaker group 1101; the acoustic response signal is respectively input into a first delay attenuation module 1104-1 and a first delay attenuation module 1104-2 for delay attenuation processing, and 8 paths of delayed and attenuated sound signals are obtained; for the 4 paths of delayed and attenuated first reference signals obtained through the first delay attenuation module group 1104-1, the first reference signals are input to the first summation module 1105a for superposition and bit expansion processing, so as to obtain first superposed reference signals; the first superimposed reference signal is input to the first microphone model 1106a, and acoustic response processing is performed according to sound pressure excitation of the first microphone model 1106a, so that a first echo estimation signal recorded by the first microphone 1102a can be obtained; for the 4 paths of delayed and attenuated second reference signals obtained through the second delay attenuation module group 1104-2, the second reference signals are input to the second summation module 1105b for superposition and bit expansion processing, so as to obtain second superposed reference signals; the second superimposed reference signal is input to the second microphone model 1106b, and acoustic response processing is performed according to sound pressure excitation of the second microphone model 1106b, so that a second echo estimation signal recorded by the second microphone 1102b can be obtained; the first echo estimation signal and the second audio signal recorded by the first microphone 1102a are jointly input to the first echo processing module 1107a for echo processing, so that an echo signal generated by the first audio signal played by the speaker group 1101 in the second audio signal recorded by the first microphone 1102a can be eliminated as much as possible, and the second echo estimation signal and the second audio signal recorded by the second microphone 1102b are jointly input to the second echo processing module 1107b for echo processing, so that an echo signal generated by the first audio signal played by the speaker group 1101 in the second audio signal recorded by the second microphone 1102b can be eliminated as much as possible; the audio signal after echo processing is further processed by noise reduction and other processing module 1109, and then modulated in encoder 1111, and finally transmitted out through transmission line routing transmitting end 1112.

It should be noted that, in the first echo processing module or the second echo processing module, the bit number expansion processing is performed without considering the second audio signals recorded by the first microphone and the second microphone.

It should be further noted that, when the number of microphones is multiple (such as the first microphone 1102a and the second microphone 1102b in fig. 11), the reference signal (i.e., the echo estimation signal) of the echo signal in each microphone needs to be obtained separately, and echo processing needs to be performed separately (such as the first echo processing module 1107a and the second echo processing module 1107b in fig. 11); however, for the acoustic response processing performed by the speaker model, because the speaker group is shared, and the speaker model and the time delay attenuation module are separately arranged, the speaker model is considered first, and then the position and distance information between the speaker and the microphone is considered, so that the audio signal at the front end of the speaker only needs to be processed by the acoustic response processing of the speaker model once, which is helpful for reducing the calculation workload in the design scheme of multiple microphones.

It is understood that when the high-frequency resonance peak of the first microphone or the second microphone is relatively high (for example, greater than 8kHz), in this case, the processing of the acoustic response of the microphone model corresponding to the microphone to the first superimposed reference signal or the second superimposed reference signal may also be cancelled; therefore, in the above specific implementation manner, preferably, after the correspondingly obtaining the first superimposed reference signal and the second superimposed reference signal, the method further includes:

acquiring a high-frequency resonance peak value of the first microphone and a high-frequency resonance peak value of the second microphone;

comparing the high-frequency resonance peak value of the first microphone and the high-frequency resonance peak value of the second microphone with preset high-frequency resonance peak values respectively;

cancelling the acoustic response processing of the first microphone model to the first superimposed reference signal if the high-frequency resonance peak of the first microphone is higher than a preset high-frequency resonance peak;

and if the high-frequency resonance peak value of the second microphone is higher than a preset high-frequency resonance peak value, canceling the acoustic response processing of the second microphone model to the second superposed reference signal.

It should be noted that, if the frequency-to-width ratio of the first microphone or the second microphone is wide, that is, the high-frequency resonance peak of the microphone is high; in this case, it is not necessary to perform the acoustic response processing on the obtained superimposed reference signal through the first microphone model or the second microphone model, that is, it is not necessary to correct the echo reference signal through the first microphone model or the second microphone model. For example, with reference to the circuit structure 110 shown in fig. 11, after the first summation module 1105a obtains the first superimposed reference signal, the determination is performed according to the bandwidth of the first microphone 1102a, that is, the high-frequency resonance peak of the first microphone 1102a is compared with the preset high-frequency resonance peak; assuming that the preset high-frequency resonance peak value is 8kHz, if the high-frequency resonance peak value of the first microphone 1102a is 5kHz, that is, the high-frequency resonance peak value of the first microphone 1102a is lower than the preset high-frequency resonance peak value, the first microphone model 1106a is required to perform acoustic response processing on the first superimposed reference signal, so that a corrected first echo estimation signal can be obtained; if the high-frequency resonance peak of the first microphone 1102a is 9kHz, that is, the high-frequency resonance peak of the first microphone 1102a is higher than the preset high-frequency resonance peak, the first microphone model 1106a is not required to perform the acoustic response processing on the superimposed reference signal, that is, the first microphone model 1106a is not required to correct the first echo estimation signal, and the step of performing the acoustic response processing on the first superimposed reference signal by the first microphone model 1106a is also cancelled; based on the same method, the determination may also be performed according to the bandwidth of the second microphone 1102b, that is, the high-frequency resonance peak value of the second microphone 1102b is compared with the preset high-frequency resonance peak value, so as to obtain whether the second echo estimation signal needs to be corrected by the second microphone model 1106 b; when the high-frequency resonance peak of the second microphone 1102b is higher than the preset high-frequency resonance peak, the second microphone model 1106b is not required to perform the acoustic response processing on the superimposed reference signal, and the step of performing the acoustic response processing on the second superimposed reference signal by the second microphone model 1106b is also eliminated.

For the technical solution shown in fig. 8, in a possible implementation manner, the inputting the first audio signal into each speaker model for acoustic response processing to obtain an acoustic response signal of each speaker specifically includes:

if the first audio signals received by the at least one loudspeaker are the same, respectively inputting the first audio signals into the at least one loudspeaker model and carrying out acoustic response processing to obtain an acoustic response signal of each loudspeaker;

and if the first audio signals received by the at least one loudspeaker are different, correspondingly inputting the first audio signals into the at least one loudspeaker model and carrying out acoustic response processing to obtain an acoustic response signal of each loudspeaker.

It should be noted that, for the first audio signals received by the multiple speakers, the first audio signals received by the multiple speakers may be the same, and at this time, the same audio signal is input to the multiple speakers respectively, and at the same time, the audio signal is also input to the multiple speaker models respectively, such as the circuit structure 90 shown in fig. 9; in addition, the first audio signals received by the multiple speakers may also be different, and these audio signals are correspondingly input to the multiple speakers, and these audio signals are also correspondingly input to multiple speaker models, such as the circuit structure 120 shown in fig. 12; referring to fig. 12, an application example of a circuit structure 120 of a four-speaker and single-microphone according to another embodiment of the present invention is shown; as shown in fig. 12, the circuit structure 120 includes: a speaker group 1201, a microphone 1202, a speaker model group 1203, a delay attenuation module group 1204, a summation module 1205, a microphone model 1206, an echo processing module 1207, a speech processing module 1208, a noise reduction and other processing module 1209, a decoder 1210, an encoder 1211 and a transmitting terminal 1212; the speaker group 1201 comprises a first speaker 1201a, a second speaker 1201b, a third speaker 1201c and a fourth speaker 1201d, the speaker model group 1203 comprises a first speaker model 1203a, a second speaker model 1203b, a third speaker model 1203c and a fourth speaker model 1203d, and the time-delay attenuation module group 1204 comprises a first time-delay attenuation module 1204a, a second time-delay attenuation module 1204b, a third time-delay attenuation module 1204c and a fourth time-delay attenuation module 1204 d; it should be further noted that the circuit structure 120 shown in fig. 12 is similar to the circuit structure 90 shown in fig. 9 in function, and the only difference is that the first audio signal is divided into 4 paths of first audio signals after passing through the speech processing module 1208, where the 4 paths of first audio signals correspondingly enter the speaker group 1201 and are played by the speaker group 1201, and meanwhile, the first audio signals received by the first speaker 1201a, the second speaker 1201b, the third speaker 1201c, and the fourth speaker 1201d are correspondingly input into the first speaker model 1203a, the second speaker model 1203b, the third speaker model 1203c, and the fourth speaker model 1203 d; the operation of the other processes is the same as that of the circuit configuration 90 shown in fig. 9, and will not be described in detail here.

The embodiment provides a signal processing method, which is applied to a signal processing device with at least one loudspeaker and at least one microphone, and is used for receiving a first audio signal and playing the first audio signal by the at least one loudspeaker; acquiring at least one echo estimation signal corresponding to the first audio signal according to at least one loudspeaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone; receiving a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone; subtracting at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo-processed audio signal; therefore, the echo howling phenomenon easily generated when the high-frequency resonance peak value of the loudspeaker or the microphone is low is effectively solved, and the calculation workload in the multi-microphone design is reduced.

Example two

Based on the same inventive concept of the foregoing embodiments, referring to fig. 13, which shows a detailed flow of a signal processing method provided by the embodiments of the present invention, in conjunction with an application example of the circuit structure 100 of a four-speaker and single-microphone shown in fig. 10, the detailed flow may include:

s1301: extracting a path of digital signal from the front end of a loudspeaker group 1001 to obtain a first audio signal;

s1302: inputting the first audio signal into a first loudspeaker model 1003a, a second loudspeaker model 1003b, a third loudspeaker model 1003c and a fourth loudspeaker model 1003d respectively to perform acoustic response processing, and obtaining an acoustic response signal of each loudspeaker;

s1303: obtaining a reference signal of each speaker according to an acoustic response signal of each speaker in the speaker group 1001 and first delay and attenuation information corresponding to the acoustic response signal;

s1304: superposing and carrying out digit extension processing on the reference signal of each loudspeaker to obtain a superposed reference signal, and taking the superposed reference signal as an echo estimation signal;

s1305: the echo estimation signal is input to an echo processing module 1006 to perform echo processing on the second audio signal recorded by the microphone 1002.

It should be noted that the above detailed procedure does not consider the correction of the echo estimation signal by the microphone model. For example, taking the circuit structure 100 shown in fig. 10 as an example, in combination with the above examples, after obtaining the superimposed reference signal, the superimposed reference signal is directly used as an echo estimation signal, and then the echo estimation signal is used to perform echo processing on the second audio signal recorded by the microphone 1002.

Referring to fig. 14, which shows a detailed flow of another signal processing method provided by the embodiment of the present invention, in conjunction with an application example of a circuit structure 90 of a four-speaker and single-microphone shown in fig. 9, the detailed flow may include:

s1401: extracting a digital signal from the front end of the speaker group 901 to obtain a first audio signal;

s1402: inputting the first audio signal into a first loudspeaker model 903a, a second loudspeaker model 903b, a third loudspeaker model 903c and a fourth loudspeaker model 903d respectively for acoustic response processing to obtain an acoustic response signal of each loudspeaker;

s1403: obtaining a reference signal of each speaker according to an acoustic response signal of each speaker in the speaker group 901 and first delay and attenuation information corresponding to the acoustic response signal;

s1404: superposing and carrying out digit extension processing on the reference signal of each loudspeaker to obtain a superposed reference signal;

s1405: acquiring a high-frequency resonance peak of the microphone 902;

s1406: comparing the high frequency resonance peak of the microphone 902 with a preset high frequency resonance peak;

s1407: if the high-frequency resonance peak value of the microphone 902 is higher than the preset high-frequency resonance peak value, directly taking the superposed reference signal as an echo estimation signal;

s1408: if the high-frequency resonance peak value of the microphone 902 is not higher than the preset high-frequency resonance peak value, performing acoustic response processing on the superposed reference signal based on the microphone model 906 to obtain an echo estimation signal;

s1409: the echo estimation signal is input to an echo processing module 907 to perform echo processing on the second audio signal recorded by the microphone 902.

It should be noted that, compared with the detailed flow shown in fig. 12, the microphone model and the judgment whether the microphone model needs to perform the acoustic response processing on the superimposed reference signal are mainly added. For example, taking the circuit structure 90 shown in fig. 9 as an example, in combination with the foregoing example, after obtaining the superimposed reference signal, the high-frequency resonance peak of the microphone 902 is obtained; assuming that the preset high-frequency resonance peak value is 8kHz, if the high-frequency resonance peak value of the microphone 902 is 5kHz, that is, the high-frequency resonance peak value of the microphone 902 is lower than the preset high-frequency resonance peak value, the microphone model 906 is required to perform acoustic response processing on the superimposed reference signal, so that a corrected echo estimation signal can be obtained, and the corrected echo estimation signal is used to perform echo processing on a second audio signal recorded by the microphone 902; if the high-frequency resonance peak of the microphone 902 is 9kHz, that is, the high-frequency resonance peak of the microphone 902 is higher than the preset high-frequency resonance peak, the microphone model 906 is not required to perform acoustic response processing on the superimposed reference signal, that is, the microphone model 906 is not required to correct the echo estimation signal, and the superimposed reference signal is directly used as the echo estimation signal to perform echo processing on the second audio signal recorded by the microphone 902.

Referring to fig. 15, which shows a detailed flow of still another signal processing method provided by the embodiment of the present invention, in conjunction with an application example of the circuit structure 120 of a four-speaker and single-microphone shown in fig. 12, the detailed flow may include:

s1501: respectively extracting one path of digital signal from the front end of each loudspeaker in the loudspeaker group 1201 to obtain four paths of first audio signals;

s1502: correspondingly inputting the four paths of first audio signals into a first loudspeaker model 1203a, a second loudspeaker model 1203b, a third loudspeaker model 1203c and a fourth loudspeaker model 1203d for acoustic response processing to obtain an acoustic response signal of each loudspeaker;

s1503: obtaining a reference signal of each speaker according to an acoustic response signal of each speaker in the speaker group 1201 and first delay and attenuation information corresponding to the acoustic response signal;

s1504: superposing and carrying out digit extension processing on the reference signal of each loudspeaker to obtain a superposed reference signal;

s1505: acquiring a high frequency resonance peak of the microphone 1202;

s1506: comparing the high frequency resonance peak of the microphone 1202 with a preset high frequency resonance peak;

s1507: if the high-frequency resonance peak value of the microphone 1202 is higher than the preset high-frequency resonance peak value, the superposed reference signal is used as an echo estimation signal;

s1508: if the high-frequency resonance peak value of the microphone 1202 is not higher than the preset high-frequency resonance peak value, performing acoustic response processing on the superposed reference signal based on the microphone model 1206 to obtain an echo estimation signal;

s1509: the echo estimation signal is input to an echo processing module 1207 to perform echo processing on the second audio signal recorded by the microphone 1202.

It should be noted that, compared with the detailed flow shown in fig. 14, it is mainly considered that the audio signals input by each speaker in the speaker group may be different; the other processing steps are the same as those of the detailed flow shown in fig. 14, and are not described in detail here.

Referring to fig. 16, a detailed flow of still another signal processing method provided by the embodiment of the invention is shown, which, in conjunction with an application example of the circuit structure 110 of the four-speaker and two-microphone shown in fig. 11, may include:

s1601: extracting a path of digital signal from the front end of each loudspeaker in the loudspeaker group 1101 respectively to obtain four paths of first audio signals;

s1602: correspondingly inputting the four paths of first audio signals into a first loudspeaker model 1103a, a second loudspeaker model 1103b, a third loudspeaker model 1103c and a fourth loudspeaker model 1103d to perform acoustic response processing, so as to obtain an acoustic response signal of each loudspeaker;

s1603: obtaining a first reference signal of each speaker according to an acoustic response signal of each speaker in the speaker group 1101 and first delay and attenuation information corresponding to the acoustic response signal;

s1604: superposing and bit expansion processing are carried out on the first reference signal of each loudspeaker to obtain a first superposed reference signal;

s1605: acquiring a high-frequency resonance peak of the first microphone 1102 a;

s1606: comparing the high frequency resonance peak value of the first microphone 1102a with a preset high frequency resonance peak value;

s1607: if the high-frequency resonance peak value of the first microphone 1102a is higher than a preset high-frequency resonance peak value, taking the first superposed reference signal as a first echo estimation signal;

s1608: if the high-frequency resonance peak value of the first microphone 1102a is not higher than the preset high-frequency resonance peak value, performing acoustic response processing on the first superimposed reference signal based on a first microphone model 1106a to obtain a first echo estimation signal;

s1609: inputting the first echo estimation signal to the first echo processing module 1107a to perform a first echo process on the second audio signal recorded by the first microphone 1102 a;

s1610: obtaining a second reference signal of each speaker according to the acoustic response signal of each speaker in the speaker group 1101 and second delay and attenuation information corresponding to the acoustic response signal;

s1611: performing superposition and digit extension processing on the second reference signal of each loudspeaker to obtain a second superposed reference signal;

s1612: acquiring a high-frequency resonance peak of the second microphone 1102 b;

s1613: comparing the high frequency resonance peak value of the second microphone 1102b with a preset high frequency resonance peak value;

s1614: if the high-frequency resonance peak value of the second microphone 1102b is higher than the preset high-frequency resonance peak value, taking the second superposed reference signal as a second echo estimation signal;

s1615: if the high-frequency resonance peak value of the second microphone 1102b is not higher than the preset high-frequency resonance peak value, performing acoustic response processing on the second superposed reference signal based on a second microphone model 1106b to obtain a second echo estimation signal;

s1616: inputting the second echo estimation signal into a second echo processing module 1107b to perform second echo processing on a second audio signal recorded by a second microphone 1102 b;

s1617: the audio signal after the first echo processing and the second echo processing is input to the noise reduction and other processing module 1109.

It should be noted that, compared with the detailed flow shown in fig. 15, the echo processing of the two microphones is mainly considered; the other processing steps are substantially the same as those of the detailed flow shown in fig. 15, and are not described in detail here. It should be noted that, for the application of two microphones, the reference signal (i.e., echo estimation signal) of the echo signal in each microphone needs to be obtained separately, and echo processing needs to be performed separately (such as the first echo processing module 1107a and the second echo processing module 1107b in fig. 11); however, for the acoustic response processing performed by the speaker model, because the speaker group is shared, and the speaker model and the time delay attenuation module are separately arranged, the speaker model is considered first, and then the position and distance information between the speaker and each microphone is considered, so that the audio signal at the front end of the speaker only needs to be processed by the acoustic response processing of the speaker model once, thereby being beneficial to reducing the calculation workload in the design scheme of the multi-microphone.

The embodiment of the invention is mainly used for carrying out echo processing on echo signals of four loudspeakers, and is also suitable for the echo processing of other multi-loudspeakers and multi-microphones and even suitable for the echo processing of a single loudspeaker; in the embodiment of the present invention, this is not particularly limited.

Referring to fig. 17, a detailed flow of still another signal processing method provided by the embodiment of the invention is shown, which, in conjunction with an application example of the circuit structure 180 of a single speaker and a single microphone shown in fig. 18, may include:

s1701: extracting a path of digital signal from the front end of a loudspeaker 1801 to obtain a first audio signal;

s1702: inputting the first audio signal into a loudspeaker model 1803 for acoustic response processing to obtain an acoustic response signal of a loudspeaker 1801;

s1703: obtaining a reference signal of the loudspeaker 1801 according to the acoustic response signal of the loudspeaker 1801 and the delay and attenuation information corresponding to the acoustic response signal;

s1704: acquiring a high-frequency resonance peak of a microphone 1802;

s1705: comparing the high frequency resonance peak value of the microphone 1802 with a preset high frequency resonance peak value;

s1706: if the high-frequency resonance peak value of the microphone 1802 is higher than a preset high-frequency resonance peak value, taking the reference signal as an echo estimation signal;

s1707: if the high-frequency resonance peak value of the microphone 1802 is not higher than the preset high-frequency resonance peak value, performing acoustic response processing on the reference signal based on a microphone model 1805 to obtain an echo estimation signal;

s1708: the echo estimation signal is input to the adder 1806 to perform echo processing on the second audio signal recorded by the microphone 1802.

For example, taking the circuit structure 180 shown in fig. 18 as an example, as shown in fig. 18, the circuit structure 180 includes: a speaker 1801, a microphone 1802, a speaker model 1803, a delay attenuation module 1804, a microphone model 1805, an adder 1806, a speech processing module 1807, a noise reduction and other processing module 1808, a decoder 1809, an encoder 1810, and a transmitting end 1811; the speaker model 1803 is correspondingly established based on the speaker 1801, the delay and attenuation module 1804 correspondingly delays and attenuates the audio signal passing through the speaker model 1803, the delay and attenuation information is obtained according to the distance and position information between the speaker 1801 and the microphone 1802, and the adder 1806 has the same function as the echo processing module, and is mainly used for performing echo processing on an echo signal generated by a first audio signal played by the speaker 1801 in a second audio signal recorded by the microphone 1802. The specific operation for the circuit configuration shown in fig. 18 is similar to that of the circuit configuration of the multi-speaker described above and will not be described in detail here.

Referring to fig. 19, which shows a detailed flow of still another signal processing method provided by the embodiment of the present invention, in conjunction with an application example of a circuit structure 200 of a dual speaker and single microphone shown in fig. 20, the detailed flow may include:

s1901: respectively extracting one path of digital signal from the front ends of the first loudspeaker 2001a and the second loudspeaker 2001b to obtain two paths of first audio signals;

s1902: correspondingly inputting the two paths of first audio signals into a first loudspeaker model 2003a and a second loudspeaker model 2003b for acoustic response processing to obtain an acoustic response signal of each loudspeaker;

s1903: obtaining a reference signal of each speaker according to the acoustic response signal of each speaker and the first time delay attenuation module 2004a and the second time delay attenuation module 2004 b;

s1904: superposing and carrying out digit extension processing on the reference signal of each loudspeaker to obtain a superposed reference signal;

s1905: acquiring a high-frequency resonance peak of the microphone 2002;

s1906: comparing the high frequency resonance peak of the microphone 2002 with a preset high frequency resonance peak;

s1907: if the high-frequency resonance peak value of the microphone 2002 is higher than the preset high-frequency resonance peak value, taking the superposed reference signal as an echo estimation signal;

s1908: if the high-frequency resonance peak value of the microphone 2002 is not higher than the preset high-frequency resonance peak value, performing acoustic response processing on the superposed reference signal based on the microphone model 2006 to obtain an echo estimation signal;

s1909: the echo estimation signal is input to an echo processing module 2007 to perform echo processing on the second audio signal recorded by the microphone 2002.

It should be noted that, for example, taking the circuit structure 200 shown in fig. 20 as an example, as shown in fig. 20, the circuit structure 200 includes: a first speaker 2001a, a second speaker 2001b, a microphone 2002, a first speaker model 2003a, a second speaker model 2003b, a first time-delay attenuation module 2004a, a second time-delay attenuation module 2004b, a summation module 2005, a microphone model 2006, an adder 2007, a speech processing module 2008, a noise reduction and other processing module 2009, a decoder 2010, an encoder 2011 and a transmitting terminal 2012; the function of the adder 2007 is the same as that of the echo processing module, and the specific operation process for the circuit structure shown in fig. 20 is similar to that of the circuit structure of the aforementioned multi-speaker (for example, fig. 11), and is not described in detail here.

Through the embodiments, specific implementation of the foregoing embodiments is elaborated in detail, and it can be seen that through the technical solutions of the foregoing embodiments, an echo howling phenomenon easily generated when a high-frequency resonance peak of a speaker or a microphone is low is effectively solved, and a calculation workload in a multi-microphone design is also reduced.

EXAMPLE III

Based on the same inventive concept of the foregoing embodiments, referring to fig. 21, which shows the composition of an information processing apparatus 210 provided by an embodiment of the present invention, the signal processing apparatus 210 may include: at least one speaker 2101, at least one microphone 2102, a first receiving portion 2103, a first acquiring portion 2104, a second receiving portion 2105 and a second acquiring portion 2106, wherein,

the first receiving portion 2103 configured to receive a first audio signal and play the first audio signal by the at least one speaker;

the first obtaining portion 2104 is configured to obtain at least one echo estimation signal corresponding to the first audio signal according to at least one speaker model, at least one microphone model and the first audio signal; wherein the at least one speaker model is derived based on the at least one speaker and the at least one microphone model is derived based on the at least one microphone;

the second receiving portion 2105 configured to receive a second audio signal with the at least one microphone; wherein the second audio signal comprises an echo signal produced by the first audio signal output by the at least one speaker and received by the at least one microphone;

the second obtaining portion 2106 is configured to subtract the at least one echo estimation signal corresponding to the first audio signal from the second audio signal to obtain an echo processed audio signal.

In the above scheme, referring to fig. 22, the signal processing apparatus 210 further includes a preprocessing section 2107 configured to:

In the above scheme, referring to fig. 23, the signal processing apparatus 210 further includes a modeling section 2108 configured to:

In the above scheme, the first acquiring portion 2104 is specifically configured to:

In the above scheme, referring to fig. 24, the signal processing apparatus 210 further includes a first comparing portion 2109 configured to:

acquiring a high-frequency resonance peak value of the microphone;

In the above scheme, referring to fig. 25, the signal processing apparatus 210 further includes a second comparing part 2110 configured to:

It is understood that in this embodiment, "part" may be part of a circuit, part of a processor, part of a program or software, etc., and may also be a unit, and may also be a module or a non-modular.

In addition, each component in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Accordingly, the present embodiment provides a computer storage medium storing a signal processing program, which when executed by at least one processor implements the steps of the method of signal processing described in the first embodiment above.

Based on the above-mentioned composition of the signal processing apparatus 210 and the computer storage medium, referring to fig. 26, which shows a specific hardware structure of the signal processing apparatus 210 provided by the embodiment of the present invention, the specific hardware structure may include: a network interface 2601, memory 2602, and a processor 2603; the various components are coupled together by a bus system 2604. It is understood that the bus system 2604 is used to enable connected communication between these components. The bus system 2604 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 2604 in FIG. 26. The network interface 2601 is configured to receive and transmit signals during information transmission and reception with other external network elements;

a memory 2602 for storing a computer program operable on the processor 2603;

a processor 2603 configured to, when running the computer program, perform:

It is to be understood that the memory 2602 in embodiments of the present invention can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double data rate Synchronous Dynamic random access memory (ddr DRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 2602 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

And the processor 2603 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 2603. The Processor 2603 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2602, and the processor 2603 reads the information in the memory 2602 and performs the steps of the method in combination with the hardware.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 2603 is further configured to execute the steps of the signal processing method according to the first embodiment when the computer program is executed.

It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A signal processing method applied to a signal processing apparatus having at least one speaker and at least one microphone, the method comprising:

2. The method of claim 1, wherein prior to said receiving and playing by said at least one speaker a first audio signal, said method further comprises:

3. The method of claim 1, wherein before the obtaining at least one echo estimation signal corresponding to the first audio signal based on at least one speaker model, at least one microphone model, and the first audio signal, the method further comprises:

4. The method of claim 3, wherein the number of microphones is one, the number of corresponding microphone models is one, and the obtaining at least one echo estimation signal corresponding to the first audio signal according to the at least one speaker model, the at least one microphone model, and the first audio signal comprises:

5. The method of claim 4, wherein after the obtaining the first superimposed reference signal, the method further comprises:

acquiring a high-frequency resonance peak value of the microphone;

6. The method of claim 3, wherein the number of microphones is two, the number of corresponding microphone models is two, and the obtaining at least one echo estimation signal corresponding to the first audio signal according to the at least one speaker model, the at least one microphone model, and the first audio signal comprises:

7. The method of claim 6, wherein after the corresponding obtaining the first superimposed reference signal and the second superimposed reference signal, the method further comprises:

8. The method according to any one of claims 4 to 7, wherein the inputting the first audio signal into each speaker model for acoustic response processing to obtain the acoustic response signal of each speaker specifically comprises:

9. A signal processing apparatus, characterized in that the signal processing apparatus comprises: at least one speaker, at least one microphone, a first receiving section, a first acquiring section, a second receiving section, and a second acquiring section, wherein,

10. A signal processing apparatus, characterized in that the signal processing apparatus comprises: a network interface, a memory, and a processor; wherein the content of the first and second substances,

the memory for storing a computer program operable on the processor;

the processor, when executing the computer program, is configured to perform the steps of the method of signal processing according to any of claims 1 to 8.

11. A computer storage medium, characterized in that it stores a signal processing program which, when executed by at least one processor, implements the steps of the method of signal processing according to any one of claims 1 to 8.